Skip to content

EngineeringSoftware/PLSemanticsBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PLSemanticsBench

Hugging Face

Table of Contents

About

PLSemanticsBench is the first benchmark for evaluating LLMs as programming language interpreters. We introduce three tasks to evaluate this:

Task Description
PredState Predicts the final program state
PredRule Predicts the ordered sequence of semantic rules needed to evaluate a program
PredTrace Predicts the step-by-step execution of a program

PLSemanticsBench is hosted on HuggingFace: PLSemanticsBench.

You must implement BaseRunner(_query method) to evaluate your models. We provide two example implementations for OpenAI models (GPTRunner) and Ollama models (OllamaRunner).

Installation

System Requirements

  • Conda package management system
  • Python 3.11 or higher
  • OpenAI API key (for running experiments with OpenAI models)

Step-by-Step Installation

  1. Create and activate the conda environment:
conda env create -f env.yaml
conda activate plsemanticsbench
  1. Set up your OpenAI API key (only for OpenAI models):
export OPENAI_API_KEY='your-api-key-here'

Quick Start

We provide a bash script quick that:

  1. Sets up the plsemanticsbench conda environment.
  2. Pulls the DeepSeek-R1 1.5B model.
  3. Evaluates the DeepSeek-R1 1.5B model on the PredState task with no-semantics and chain-of-thought prompting on the Human-Written dataset.
  4. Prints the accuracy and malformed-count to screen.
  5. Creates metrics-predstate-deepseek-r1:1.5b.json that contains the evaluation result.
bash quick

Detailed Usage

Basic Example

Here's a minimal example to get started:

from plsemanticsbench import GPTRunner
from plsemanticsbench import ExperimentArgs, LLMEvaluator
from plsemanticsbench import (
    PROMPT_STRATEGY,
    Task,
    Formalization,
    Semantics_Type,
    Language,
    PLDataset
)

# Model name
model_name = "o3-mini"

# Experiment args: Run the PredState task on the IMP language with
# standard semantics formalized using SOS and with direct prompting
exp_args = ExperimentArgs(
    dataset=PLDataset.Human_Written,
    task=Task.PredState,
    language=Language.IMP,
    formalization=Formalization.SOS,
    semantics_type=Semantics_Type.Standard,
    model_name=model_name,
    prompt_strategy=PROMPT_STRATEGY.DA,
    num_datapoints_to_run=2, # Run just 2 datapoints (omit to run entire dataset)
)
                        
# Run inference using the OpenAI API
gpt_runner = GPTRunner(args=exp_args)

# Generation (generate LLM prediction on the predstate task)
predictions = gpt_runner.do_experiment() # path to dump results can be provided

# Evaluation (evaluate LLM prediction against ground-truth)
llm_eval = LLMEvaluator(task=exp_args.task, semantics_type=exp_args.semantics_type)
evaluation_result = llm_eval.evaluate_from_list(results=predictions, model_name=model_name)
print(evaluation_result)

Expected Output

{
    'accuracy': 1,
    'malformed-count': 0,
}

Benchmark

Our benchmark is hosted on HuggingFace: PLSemanticsBench.

Benchmark Access

You can load the dataset using the datasets library. Here is an example:

from datasets import load_dataset

# Load PredState task with standard semantics (uk) and K-semantics formalization (K) and with the Human Written (human-written) dataset
predstate_IMP_K_uk_human_written = load_dataset("EngineeringSoftware/PLSemanticsBench", name="predstate-IMP-K-uk-human-written")

# Load PredRule task with nonstandard semantics (mk) ans SOS formalization (SOS) and with the LLM Translated (llm-translated) dataset
predrule_IMP_SOS_mk_llm_translated = load_dataset("EngineeringSoftware/PLSemanticsBench", name="predrule-IMP-SOS-mk-llm-translated")

# Load PredState task with no-semantics (nk) and with the Fuzzer Generated (fuzzer-generated) dataset
predstate_IMP_nk_fuzzer_generated = load_dataset("EngineeringSoftware/PLSemanticsBench", name="predstate-IMP-nk-fuzzer-generated")

Dataset Split

Task Split Description
PredState
(Final State Prediction)
predstate-IMP-nk-{dataset-name} No semantics
predstate-IMP-K-uk-{dataset-name} Standard semantics with K-semantics formalization
predstate-IMP-K-mk-{dataset-name} Nonstandard semantics with K-semantics formalization
predstate-IMP-SOS-uk-{dataset-name} Standard semantics with SOS formalization
predstate-IMP-SOS-mk-{dataset-name} Nonstandard semantics with SOS formalization
PredRule
(Semantic Rule Prediction)
predrule-IMP-K-uk-human-written Standard semantics with K-semantics formalization
predrule-IMP-K-mk-human-written Nonstandard semantics with K-semantics formalization
predrule-IMP-SOS-uk-human-written Standard semantics with SOS formalization
predrule-IMP-SOS-mk-human-written Nonstandard semantics with SOS formalization
PredTrace
(Execution Trace Prediction)
predtrace-IMP-K-uk-human-written Standard semantics with K-semantics formalization
predtrace-IMP-K-mk-human-written Nonstandard semantics with K-semantics formalization
predtrace-IMP-SOS-uk-human-written Standard semantics with SOS formalization
predtrace-IMP-SOS-mk-human-written Nonstandard semantics with SOS formalization

Data Example

One example of the dataset is as follows:

{
  "program": "int ans; ans = 1; ...",
  "syntax": "<program> :: ...",
  "semantics": "ℤ := Set of integers ...",
  "mutated-program": "int ans; ans = 1; ...",
  "mutation-pattern": "KeyWordSwap",
  "exec-trace": [
    {
      "linenumber": 1,
      "rule": ["Rule 38", "Rule 39"],
      "state": {"ans": 1}
    }
  ],
  "ground-truth": "<answer>...</answer>"
}

Citation

@article{ThimmaiahETAL25PLSemanticsBench,
  title={PLSemanticsBench: Large Language Models As Programming Language Interpreters},
  author={Aditya Thimmaiah, Jiyang Zhang, Jayanth Srinivasa, Junyi Jessy Li, Milos Gligoric},
  year={2025},
  archivePrefix={arXiv},
  url={https://arxiv.org/abs/2510.03415}, 
}

License

This project is licensed under the CC BY 4.0 License.

About

The first benchmark to evaluate LLMs' usability as programming-language interpreters

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published