SolidCoder: Bridging the Mental-Reality Gap in LLM Code Generation through Concrete Execution

ACL 2026 Findings | Official Repository

Abstract

State-of-the-art code generation frameworks rely on mental simulation, where LLMs internally trace execution to verify correctness. We expose a fundamental limitation: the Mental-Reality Gap—where models hallucinate execution traces and confidently validate buggy code. This gap manifests along two orthogonal dimensions: the Specification Gap (overlooking edge cases during planning) and the Verification Gap (hallucinating correct behavior for flawed code).

We propose SolidCoder with a simple principle: don't imagine—execute. The S.O.L.I.D. architecture addresses both dimensions by forcing edge-case awareness before algorithm design and replacing imagined traces with sandboxed execution using property-based oracles. With GPT-4o, SolidCoder achieves state-of-the-art pass@1: 95.7% on HumanEval (+0.6%p), 77.0% on CodeContests (+4.3%p), and 26.7% on APPS (+3.4%p). Ablation reveals that edge-case awareness provides the largest individual gain, while execution grounding catches categorically different errors that specification improvements cannot address. These gains generalize to RL post-trained models, validating that bridging both gap dimensions is essential for robust code synthesis. We release our code and framework to facilitate future research.

Our S.O.L.I.D. architecture integrates five synergistic components:

Component	Description
Shift-left Planning	Identifies edge cases (empty inputs, boundary values, corner cases) before formulating the algorithmic plan, forcing robust algorithm design from the outset
Oracle-based Assertions	Generates property-based tests without ground-truth outputs by verifying domain-invariant properties (e.g., output length, permutation constraints) rather than exact values
Live Execution	Runs generated code in a sandboxed environment (5-second timeout) to provide concrete runtime feedback, eliminating hallucinated execution traces
Intermediate Simulation	Traces code step-by-step on sample inputs immediately after generation to catch plan-to-code translation errors (off-by-one, operator precedence) before live execution
Defensive Accumulation	Maintains a persistent test suite that grows throughout debugging; all accumulated tests are re-executed after every code modification to prevent regression

Key Results

SolidCoder consistently outperforms the previous state-of-the-art (CodeSIM) across all models and benchmarks. Results show pass@1 accuracy (%):

Main Results (Table 2)

Model	Benchmark	Direct	CoT	Self-Plan	Analogical	MapCoder	CodeSIM	SolidCoder
GPT-4o	HumanEval	90.2	90.9	89.0	88.4	90.2	95.1	95.7
	CodeContests	42.4	44.2	49.1	30.3	69.1	72.7	77.0
	APPS	10.7	17.3	14.7	14.0	20.7	23.3	26.7
GPT-OSS-120B	HumanEval	69.5	96.3	90.8	89.0	61.0	98.2	98.2
	CodeContests	75.8	75.2	75.2	75.2	44.8	87.9	92.1
	APPS	35.3	32.7	34.7	30.7	24.0	39.3	40.7
Grok-4.1-Fast	HumanEval	88.4	96.9	96.9	95.7	95.7	97.6	97.6
	CodeContests	79.4	85.4	81.2	77.0	83.6	95.2	98.2
	APPS	37.3	34.7	37.3	36.0	33.3	41.3	42.0

Summary

Benchmark	Avg. CodeSIM	Avg. SolidCoder	Improvement
HumanEval	97.0%	97.2%	+0.2%p
CodeContests	85.3%	89.1%	+3.8%p
APPS	34.6%	36.5%	+1.9%p

The largest gains appear on CodeContests—where mental simulation begins to fail but problems remain tractable for execution-grounded verification. Notably, Grok-4.1-Fast with SolidCoder reaches 98.2% on CodeContests, approaching ceiling performance.

Architecture Overview

Figure 1: Comparison between CodeSIM (mental simulation) and SolidCoder (concrete execution). Gray boxes are shared with CodeSIM; blue boxes are S.O.L.I.D. components introduced by SolidCoder.

Mental-Reality Gap

The core insight of SolidCoder is that LLMs hallucinate execution traces during mental simulation:

Approach	Verification Method	Limitation
CodeSIM	Mental Simulation ("imagines")	Hallucinates correct behavior for buggy code
SolidCoder	Live Execution ("executes")	Grounds verification in runtime reality

Figure 2: Concrete example demonstrating the Mental-Reality Gap. Left (CodeSIM): Mental simulation hallucinates correct behavior for buggy code, incorrectly validating a flawed solution. Right (SolidCoder): Live execution catches the bug through concrete runtime feedback, enabling proper debugging and test accumulation. This illustrates how execution-grounded verification detects errors that mental simulation misses.

Installation

Prerequisites

Python 3.10+
Docker (for ExecEval sandbox, required for APPS & CodeContests)

Setup

# Clone repository
git clone https://github.com/Anonymous-ARR-Submissions/SolidCoder.git
cd SolidCoder

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or: venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

# Set up environment variables
cp .env.example .env
# Edit .env with your API keys

ExecEval Setup (Required for APPS & CodeContests)

# Pull and run ExecEval container
docker pull ntunlp/execeval
docker run -d -p 5000:5000 --name execeval ntunlp/execeval

# Verify
curl http://127.0.0.1:5000/api/all_runtimes

Usage

Quick Start: Run SolidCoder

# Set API key
export OPENROUTER_API_KEY='your-api-key'

# Run SolidCoder with all S.O.L.I.D. components
PYTHONPATH=./src python src/main.py \
    --dataset HumanEval \
    --strategy SolidCoder \
    --model openai/gpt-4o-2024-08-06 \
    --model_provider OpenRouter \
    --enable_shift_left \
    --enable_oracle_assert \
    --enable_live_verify \
    --enable_inter_sim \
    --enable_defensive_test \
    --temperature 0 \
    --verbose 1

S.O.L.I.D. Component Flags

Each component can be individually enabled/disabled for ablation studies:

Flag	Component	Description
`--enable_shift_left`	[S]	Edge-case identification before planning
`--enable_oracle_assert`	[O]	Property-based test generation
`--enable_live_verify`	[L]	Sandboxed code execution
`--enable_inter_sim`	[I]	Intermediate code simulation
`--enable_defensive_test`	[D]	Failing test accumulation

Available Options

Option	Values	Description
`--dataset`	`HumanEval`, `CC`, `APPS`	Benchmark dataset
`--strategy`	`Direct`, `CoT`, `SelfPlanning`, `Analogical`, `MapCoder`, `CodeSIM`, `SolidCoder`	Prompting strategy
`--model`	`openai/gpt-4o-2024-08-06`, `openai/gpt-oss-120b`, `x-ai/grok-4.1-fast`	Model name (OpenRouter format)
`--model_provider`	`OpenAI`, `OpenRouter`, `Anthropic`, `Gemini`, `vLLM`	API provider
`--temperature`	`0` (default)	Sampling temperature
`--verbose`	`0`, `1`, `2`	Logging verbosity (2 = full trace)
`--store_log_in_file`	`yes`, `no`	Save detailed logs
`--cont`	`yes`, `no`	Resume from existing results

API Configuration

# OpenRouter (Recommended - supports all models)
export OPENROUTER_API_KEY='your-api-key'

# OpenAI
export OPENAI_API_KEY='your-api-key'

# Anthropic
export ANTHROPIC_API_KEY='your-api-key'

Datasets

Dataset	Problems	Difficulty	ExecEval Required
HumanEval	164	Easy	No
CodeContests (CC)	165	Medium	Yes
APPS	150	Hard	Yes

Experiment Reproduction

Models Used in Paper

Model	Description	OpenRouter ID
GPT-4o	Primary baseline (comparison with CodeSIM)	`openai/gpt-4o-2024-08-06`
GPT-OSS-120B	Open-source GPT, RL post-trained	`openai/gpt-oss-120b`
Grok-4.1-Fast	Non-GPT frontier model, RL post-trained	`x-ai/grok-4.1-fast`

Main Experiments (Table 2)

# Run all strategies for a given model and dataset
for strategy in Direct CoT SelfPlanning Analogical MapCoder CodeSIM SolidCoder; do
    PYTHONPATH=./src python src/main.py \
        --dataset CC \
        --strategy $strategy \
        --model openai/gpt-4o-2024-08-06 \
        --model_provider OpenRouter \
        --temperature 0 \
        --verbose 1 \
        --store_log_in_file yes \
        $([ "$strategy" == "SolidCoder" ] && echo "--enable_shift_left --enable_oracle_assert --enable_live_verify --enable_inter_sim --enable_defensive_test")
done

Ablation Studies (Table 3, CodeContests only)

# Full SolidCoder (baseline)
PYTHONPATH=./src python src/main.py \
    --dataset CC --strategy SolidCoder --model openai/gpt-4o-2024-08-06 --model_provider OpenRouter \
    --enable_shift_left --enable_oracle_assert --enable_live_verify --enable_inter_sim --enable_defensive_test

# w/o Shift-left (-S)
PYTHONPATH=./src python src/main.py \
    --dataset CC --strategy SolidCoder --model openai/gpt-4o-2024-08-06 --model_provider OpenRouter \
    --enable_oracle_assert --enable_live_verify --enable_inter_sim --enable_defensive_test

# w/o Oracle (-O)
PYTHONPATH=./src python src/main.py \
    --dataset CC --strategy SolidCoder --model openai/gpt-4o-2024-08-06 --model_provider OpenRouter \
    --enable_shift_left --enable_live_verify --enable_inter_sim --enable_defensive_test

# w/o Live Execution (-L)
PYTHONPATH=./src python src/main.py \
    --dataset CC --strategy SolidCoder --model openai/gpt-4o-2024-08-06 --model_provider OpenRouter \
    --enable_shift_left --enable_oracle_assert --enable_inter_sim --enable_defensive_test

# w/o Intermediate Simulation (-I)
PYTHONPATH=./src python src/main.py \
    --dataset CC --strategy SolidCoder --model openai/gpt-4o-2024-08-06 --model_provider OpenRouter \
    --enable_shift_left --enable_oracle_assert --enable_live_verify --enable_defensive_test

# w/o Defensive Accumulation (-D)
PYTHONPATH=./src python src/main.py \
    --dataset CC --strategy SolidCoder --model openai/gpt-4o-2024-08-06 --model_provider OpenRouter \
    --enable_shift_left --enable_oracle_assert --enable_live_verify --enable_inter_sim

Project Structure

SolidCoder/
├── src/
│   ├── main.py                    # Entry point
│   ├── promptings/
│   │   ├── SolidCoder.py          # S.O.L.I.D. implementation
│   │   ├── CodeSIM.py             # Baseline reimplementation
│   │   ├── MapCoder.py            # Baseline
│   │   ├── SelfPlanning.py        # Baseline
│   │   ├── Analogical.py          # Baseline
│   │   ├── CoT.py                 # Baseline
│   │   └── Direct.py              # Baseline
│   ├── models/                    # LLM providers
│   │   ├── OpenRouterModel.py     # OpenRouter API
│   │   ├── OpenAI.py              # OpenAI API
│   │   ├── Anthropic.py           # Claude API
│   │   ├── Gemini.py              # Google API
│   │   └── VLLMModel.py           # Local vLLM
│   ├── datasets/                  # Dataset loaders
│   │   ├── HumanEvalDataset.py
│   │   ├── CodeContestDataset.py
│   │   └── APPSDataset.py
│   └── evaluations/               # Code execution & evaluation
├── data/
│   ├── HumanEval/HumanEval.jsonl
│   ├── CodeContest/Test.jsonl
│   └── APPS/selected150.jsonl
├── results/                       # Experiment outputs
└── scripts/
    └── run_experiments.sh         # Batch experiment runner

Results Format

Results are saved in:

results/{Dataset}/{Strategy}/{Model}/Python3-{temp}-0.95-1/Run-{n}/
├── Results.jsonl    # Per-problem results with generated code
├── Log.txt          # Execution log with prompts and responses
└── Summary.txt      # Statistics summary

Check accuracy:

# View summary statistics
cat results/CC/SolidCoder/gpt-4o/*/Run-1/Summary.txt

# Get final accuracy from log
grep "number of success" results/CC/SolidCoder/gpt-4o/*/Run-1/Log.txt | tail -1
# Output: completed 165/165, Solved: True, number of success = 127/165, acc = 76.97

Safety Mechanisms

SolidCoder implements safety measures for live code execution:

Mechanism	Implementation	Purpose
Timeout	5-second limit	Prevent infinite loops
Input Blocking	`input()` → RuntimeError	Prevent hanging on stdin
Isolated Namespace	Fresh `exec_globals` per run	Prevent state leakage
ExecEval Docker	Sandboxed container	Secure execution for competition problems

Comparison with Related Work

Approach	Exemplar	Plan	Edge Aware	Mental Sim.	Live Exec.	Debug	Defensive
Reflexion	✗	✗	✗	✗	✗	✓	✗
Self-Planning	✗	✓	✗	✗	✗	✗	✗
Analogical	✓	✗	✗	✗	✗	✗	✗
LATS	✗	✓	✗	✗	✗	✓	✗
MapCoder	✓	✓	✗	✗	✗	✓	✗
CodeSIM	✓	✓	✗	✓	✗	✓	✗
SolidCoder	✓	✓	✓	✓	✓	✓	✓

SolidCoder uniquely combines all capabilities: exemplar-based planning with edge-case awareness, mental simulation augmented by live execution, and defensive test accumulation.

Citation

@inproceedings{lee2026solidcoder,
    title={SolidCoder: Bridging the Mental-Reality Gap in LLM Code Generation through Concrete Execution},
    author={Woojin Lee and Jin-Xia Huang},
    booktitle={Findings of the Association for Computational Linguistics: ACL 2026},
    year={2026}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

This work builds upon the CodeSIM framework. We thank the authors for their foundational contributions to multi-agent code generation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SolidCoder: Bridging the Mental-Reality Gap in LLM Code Generation through Concrete Execution

Abstract

Key Results

Main Results (Table 2)

Summary

Architecture Overview

Mental-Reality Gap

Installation

Prerequisites

Setup

ExecEval Setup (Required for APPS & CodeContests)

Usage

Quick Start: Run SolidCoder

S.O.L.I.D. Component Flags

Available Options

API Configuration

Datasets

Experiment Reproduction

Models Used in Paper

Main Experiments (Table 2)

Ablation Studies (Table 3, CodeContests only)

Project Structure

Results Format

Safety Mechanisms

Comparison with Related Work

Citation

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
results		results
scripts		scripts
src		src
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
fig1.png		fig1.png
fig2.png		fig2.png
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

SolidCoder: Bridging the Mental-Reality Gap in LLM Code Generation through Concrete Execution

Abstract

Key Results

Main Results (Table 2)

Summary

Architecture Overview

Mental-Reality Gap

Installation

Prerequisites

Setup

ExecEval Setup (Required for APPS & CodeContests)

Usage

Quick Start: Run SolidCoder

S.O.L.I.D. Component Flags

Available Options

API Configuration

Datasets

Experiment Reproduction

Models Used in Paper

Main Experiments (Table 2)

Ablation Studies (Table 3, CodeContests only)

Project Structure

Results Format

Safety Mechanisms

Comparison with Related Work

Citation

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages