PLSemanticsBench is the first benchmark for evaluating LLMs as programming language interpreters. We introduce three tasks to evaluate this:
Task | Description |
---|---|
✨ PredState | Predicts the final program state |
✨ PredRule | Predicts the ordered sequence of semantic rules needed to evaluate a program |
✨ PredTrace | Predicts the step-by-step execution of a program |
PLSemanticsBench is hosted on HuggingFace: PLSemanticsBench.
You must implement BaseRunner(_query
method) to evaluate your models. We provide two example implementations for OpenAI models (GPTRunner) and Ollama models (OllamaRunner).
- Conda package management system
- Python 3.11 or higher
- OpenAI API key (for running experiments with OpenAI models)
- Create and activate the conda environment:
conda env create -f env.yaml
conda activate plsemanticsbench
- Set up your OpenAI API key (only for OpenAI models):
export OPENAI_API_KEY='your-api-key-here'
We provide a bash script quick
that:
- Sets up the
plsemanticsbench
conda environment. - Pulls the
DeepSeek-R1 1.5B
model. - Evaluates the
DeepSeek-R1 1.5B
model on thePredState
task withno-semantics
andchain-of-thought
prompting on theHuman-Written
dataset. - Prints the
accuracy
andmalformed-count
to screen. - Creates
metrics-predstate-deepseek-r1:1.5b.json
that contains the evaluation result.
bash quick
Here's a minimal example to get started:
from plsemanticsbench import GPTRunner
from plsemanticsbench import ExperimentArgs, LLMEvaluator
from plsemanticsbench import (
PROMPT_STRATEGY,
Task,
Formalization,
Semantics_Type,
Language,
PLDataset
)
# Model name
model_name = "o3-mini"
# Experiment args: Run the PredState task on the IMP language with
# standard semantics formalized using SOS and with direct prompting
exp_args = ExperimentArgs(
dataset=PLDataset.Human_Written,
task=Task.PredState,
language=Language.IMP,
formalization=Formalization.SOS,
semantics_type=Semantics_Type.Standard,
model_name=model_name,
prompt_strategy=PROMPT_STRATEGY.DA,
num_datapoints_to_run=2, # Run just 2 datapoints (omit to run entire dataset)
)
# Run inference using the OpenAI API
gpt_runner = GPTRunner(args=exp_args)
# Generation (generate LLM prediction on the predstate task)
predictions = gpt_runner.do_experiment() # path to dump results can be provided
# Evaluation (evaluate LLM prediction against ground-truth)
llm_eval = LLMEvaluator(task=exp_args.task, semantics_type=exp_args.semantics_type)
evaluation_result = llm_eval.evaluate_from_list(results=predictions, model_name=model_name)
print(evaluation_result)
{
'accuracy': 1,
'malformed-count': 0,
}
Our benchmark is hosted on HuggingFace: PLSemanticsBench.
You can load the dataset using the datasets
library. Here is an example:
from datasets import load_dataset
# Load PredState task with standard semantics (uk) and K-semantics formalization (K) and with the Human Written (human-written) dataset
predstate_IMP_K_uk_human_written = load_dataset("EngineeringSoftware/PLSemanticsBench", name="predstate-IMP-K-uk-human-written")
# Load PredRule task with nonstandard semantics (mk) ans SOS formalization (SOS) and with the LLM Translated (llm-translated) dataset
predrule_IMP_SOS_mk_llm_translated = load_dataset("EngineeringSoftware/PLSemanticsBench", name="predrule-IMP-SOS-mk-llm-translated")
# Load PredState task with no-semantics (nk) and with the Fuzzer Generated (fuzzer-generated) dataset
predstate_IMP_nk_fuzzer_generated = load_dataset("EngineeringSoftware/PLSemanticsBench", name="predstate-IMP-nk-fuzzer-generated")
Task | Split | Description |
---|---|---|
✨ PredState (Final State Prediction) |
predstate-IMP-nk-{dataset-name} | No semantics |
predstate-IMP-K-uk-{dataset-name} | Standard semantics with K-semantics formalization | |
predstate-IMP-K-mk-{dataset-name} | Nonstandard semantics with K-semantics formalization | |
predstate-IMP-SOS-uk-{dataset-name} | Standard semantics with SOS formalization | |
predstate-IMP-SOS-mk-{dataset-name} | Nonstandard semantics with SOS formalization | |
✨ PredRule (Semantic Rule Prediction) |
predrule-IMP-K-uk-human-written | Standard semantics with K-semantics formalization |
predrule-IMP-K-mk-human-written | Nonstandard semantics with K-semantics formalization | |
predrule-IMP-SOS-uk-human-written | Standard semantics with SOS formalization | |
predrule-IMP-SOS-mk-human-written | Nonstandard semantics with SOS formalization | |
✨ PredTrace (Execution Trace Prediction) |
predtrace-IMP-K-uk-human-written | Standard semantics with K-semantics formalization |
predtrace-IMP-K-mk-human-written | Nonstandard semantics with K-semantics formalization | |
predtrace-IMP-SOS-uk-human-written | Standard semantics with SOS formalization | |
predtrace-IMP-SOS-mk-human-written | Nonstandard semantics with SOS formalization |
One example of the dataset is as follows:
{
"program": "int ans; ans = 1; ...",
"syntax": "<program> :: ...",
"semantics": "ℤ := Set of integers ...",
"mutated-program": "int ans; ans = 1; ...",
"mutation-pattern": "KeyWordSwap",
"exec-trace": [
{
"linenumber": 1,
"rule": ["Rule 38", "Rule 39"],
"state": {"ans": 1}
}
],
"ground-truth": "<answer>...</answer>"
}
@article{ThimmaiahETAL25PLSemanticsBench,
title={PLSemanticsBench: Large Language Models As Programming Language Interpreters},
author={Aditya Thimmaiah, Jiyang Zhang, Jayanth Srinivasa, Junyi Jessy Li, Milos Gligoric},
year={2025},
archivePrefix={arXiv},
url={https://arxiv.org/abs/2510.03415},
}
This project is licensed under the CC BY 4.0 License.