PLSemanticsBench

About

PLSemanticsBench is the first benchmark for evaluating LLMs as programming language interpreters. We introduce three tasks to evaluate this:

Task	Description
✨ PredState	Predicts the final program state
✨ PredRule	Predicts the ordered sequence of semantic rules needed to evaluate a program
✨ PredTrace	Predicts the step-by-step execution of a program

PLSemanticsBench is hosted on HuggingFace: PLSemanticsBench.

You must implement BaseRunner(_query method) to evaluate your models. We provide two example implementations for OpenAI models (GPTRunner) and Ollama models (OllamaRunner).

Installation

System Requirements

Conda package management system
Python 3.11 or higher
OpenAI API key (for running experiments with OpenAI models)

Step-by-Step Installation

Create and activate the conda environment:

conda env create -f env.yaml
conda activate plsemanticsbench

Set up your OpenAI API key (only for OpenAI models):

export OPENAI_API_KEY='your-api-key-here'

Quick Start

We provide a bash script quick that:

Sets up the plsemanticsbench conda environment.
Pulls the DeepSeek-R1 1.5B model.
Evaluates the DeepSeek-R1 1.5B model on the PredState task with no-semantics and chain-of-thought prompting on the Human-Written dataset.
Prints the accuracy and malformed-count to screen.
Creates metrics-predstate-deepseek-r1:1.5b.json that contains the evaluation result.

bash quick

Detailed Usage

Basic Example

Here's a minimal example to get started:

from plsemanticsbench import GPTRunner
from plsemanticsbench import ExperimentArgs, LLMEvaluator
from plsemanticsbench import (
    PROMPT_STRATEGY,
    Task,
    Formalization,
    Semantics_Type,
    Language,
    PLDataset
)

# Model name
model_name = "o3-mini"

# Experiment args: Run the PredState task on the IMP language with
# standard semantics formalized using SOS and with direct prompting
exp_args = ExperimentArgs(
    dataset=PLDataset.Human_Written,
    task=Task.PredState,
    language=Language.IMP,
    formalization=Formalization.SOS,
    semantics_type=Semantics_Type.Standard,
    model_name=model_name,
    prompt_strategy=PROMPT_STRATEGY.DA,
    num_datapoints_to_run=2, # Run just 2 datapoints (omit to run entire dataset)
)
                        
# Run inference using the OpenAI API
gpt_runner = GPTRunner(args=exp_args)

# Generation (generate LLM prediction on the predstate task)
predictions = gpt_runner.do_experiment() # path to dump results can be provided

# Evaluation (evaluate LLM prediction against ground-truth)
llm_eval = LLMEvaluator(task=exp_args.task, semantics_type=exp_args.semantics_type)
evaluation_result = llm_eval.evaluate_from_list(results=predictions, model_name=model_name)
print(evaluation_result)

Expected Output

{
    'accuracy': 1,
    'malformed-count': 0,
}

Benchmark

Our benchmark is hosted on HuggingFace: PLSemanticsBench.

Benchmark Access

You can load the dataset using the datasets library. Here is an example:

from datasets import load_dataset

# Load PredState task with standard semantics (uk) and K-semantics formalization (K) and with the Human Written (human-written) dataset
predstate_IMP_K_uk_human_written = load_dataset("EngineeringSoftware/PLSemanticsBench", name="predstate-IMP-K-uk-human-written")

# Load PredRule task with nonstandard semantics (mk) ans SOS formalization (SOS) and with the LLM Translated (llm-translated) dataset
predrule_IMP_SOS_mk_llm_translated = load_dataset("EngineeringSoftware/PLSemanticsBench", name="predrule-IMP-SOS-mk-llm-translated")

# Load PredState task with no-semantics (nk) and with the Fuzzer Generated (fuzzer-generated) dataset
predstate_IMP_nk_fuzzer_generated = load_dataset("EngineeringSoftware/PLSemanticsBench", name="predstate-IMP-nk-fuzzer-generated")

Dataset Split

Task	Split	Description
✨ PredState (Final State Prediction)	predstate-IMP-nk-{dataset-name}	No semantics
	predstate-IMP-K-uk-{dataset-name}	Standard semantics with K-semantics formalization
	predstate-IMP-K-mk-{dataset-name}	Nonstandard semantics with K-semantics formalization
	predstate-IMP-SOS-uk-{dataset-name}	Standard semantics with SOS formalization
	predstate-IMP-SOS-mk-{dataset-name}	Nonstandard semantics with SOS formalization
✨ PredRule (Semantic Rule Prediction)	predrule-IMP-K-uk-human-written	Standard semantics with K-semantics formalization
	predrule-IMP-K-mk-human-written	Nonstandard semantics with K-semantics formalization
	predrule-IMP-SOS-uk-human-written	Standard semantics with SOS formalization
	predrule-IMP-SOS-mk-human-written	Nonstandard semantics with SOS formalization
✨ PredTrace (Execution Trace Prediction)	predtrace-IMP-K-uk-human-written	Standard semantics with K-semantics formalization
	predtrace-IMP-K-mk-human-written	Nonstandard semantics with K-semantics formalization
	predtrace-IMP-SOS-uk-human-written	Standard semantics with SOS formalization
	predtrace-IMP-SOS-mk-human-written	Nonstandard semantics with SOS formalization

Data Example

One example of the dataset is as follows:

{
  "program": "int ans; ans = 1; ...",
  "syntax": "<program> :: ...",
  "semantics": "ℤ := Set of integers ...",
  "mutated-program": "int ans; ans = 1; ...",
  "mutation-pattern": "KeyWordSwap",
  "exec-trace": [
    {
      "linenumber": 1,
      "rule": ["Rule 38", "Rule 39"],
      "state": {"ans": 1}
    }
  ],
  "ground-truth": "<answer>...</answer>"
}

Citation

@article{ThimmaiahETAL25PLSemanticsBench,
  title={PLSemanticsBench: Large Language Models As Programming Language Interpreters},
  author={Aditya Thimmaiah, Jiyang Zhang, Jayanth Srinivasa, Junyi Jessy Li, Milos Gligoric},
  year={2025},
  archivePrefix={arXiv},
  url={https://arxiv.org/abs/2510.03415}, 
}

License

This project is licensed under the CC BY 4.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
src/plsemanticsbench		src/plsemanticsbench
.gitignore		.gitignore
README.md		README.md
env.yaml		env.yaml
quick		quick

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PLSemanticsBench

Table of Contents

About

Installation

System Requirements

Step-by-Step Installation

Quick Start

Detailed Usage

Basic Example

Expected Output

Benchmark

Benchmark Access

Dataset Split

Data Example

Citation

License

About

Uh oh!

Releases

Packages

Languages

EngineeringSoftware/PLSemanticsBench

Folders and files

Latest commit

History

Repository files navigation

PLSemanticsBench

Table of Contents

About

Installation

System Requirements

Step-by-Step Installation

Quick Start

Detailed Usage

Basic Example

Expected Output

Benchmark

Benchmark Access

Dataset Split

Data Example

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages