ACL 2026 Findings | Official Repository
State-of-the-art code generation frameworks rely on mental simulation, where LLMs internally trace execution to verify correctness. We expose a fundamental limitation: the Mental-Reality Gap—where models hallucinate execution traces and confidently validate buggy code. This gap manifests along two orthogonal dimensions: the Specification Gap (overlooking edge cases during planning) and the Verification Gap (hallucinating correct behavior for flawed code).
We propose SolidCoder with a simple principle: don't imagine—execute. The S.O.L.I.D. architecture addresses both dimensions by forcing edge-case awareness before algorithm design and replacing imagined traces with sandboxed execution using property-based oracles. With GPT-4o, SolidCoder achieves state-of-the-art pass@1: 95.7% on HumanEval (+0.6%p), 77.0% on CodeContests (+4.3%p), and 26.7% on APPS (+3.4%p). Ablation reveals that edge-case awareness provides the largest individual gain, while execution grounding catches categorically different errors that specification improvements cannot address. These gains generalize to RL post-trained models, validating that bridging both gap dimensions is essential for robust code synthesis. We release our code and framework to facilitate future research.
Our S.O.L.I.D. architecture integrates five synergistic components:
| Component | Description |
|---|---|
| Shift-left Planning | Identifies edge cases (empty inputs, boundary values, corner cases) before formulating the algorithmic plan, forcing robust algorithm design from the outset |
| Oracle-based Assertions | Generates property-based tests without ground-truth outputs by verifying domain-invariant properties (e.g., output length, permutation constraints) rather than exact values |
| Live Execution | Runs generated code in a sandboxed environment (5-second timeout) to provide concrete runtime feedback, eliminating hallucinated execution traces |
| Intermediate Simulation | Traces code step-by-step on sample inputs immediately after generation to catch plan-to-code translation errors (off-by-one, operator precedence) before live execution |
| Defensive Accumulation | Maintains a persistent test suite that grows throughout debugging; all accumulated tests are re-executed after every code modification to prevent regression |
SolidCoder consistently outperforms the previous state-of-the-art (CodeSIM) across all models and benchmarks. Results show pass@1 accuracy (%):
| Model | Benchmark | Direct | CoT | Self-Plan | Analogical | MapCoder | CodeSIM | SolidCoder |
|---|---|---|---|---|---|---|---|---|
| GPT-4o | HumanEval | 90.2 | 90.9 | 89.0 | 88.4 | 90.2 | 95.1 | 95.7 |
| CodeContests | 42.4 | 44.2 | 49.1 | 30.3 | 69.1 | 72.7 | 77.0 | |
| APPS | 10.7 | 17.3 | 14.7 | 14.0 | 20.7 | 23.3 | 26.7 | |
| GPT-OSS-120B | HumanEval | 69.5 | 96.3 | 90.8 | 89.0 | 61.0 | 98.2 | 98.2 |
| CodeContests | 75.8 | 75.2 | 75.2 | 75.2 | 44.8 | 87.9 | 92.1 | |
| APPS | 35.3 | 32.7 | 34.7 | 30.7 | 24.0 | 39.3 | 40.7 | |
| Grok-4.1-Fast | HumanEval | 88.4 | 96.9 | 96.9 | 95.7 | 95.7 | 97.6 | 97.6 |
| CodeContests | 79.4 | 85.4 | 81.2 | 77.0 | 83.6 | 95.2 | 98.2 | |
| APPS | 37.3 | 34.7 | 37.3 | 36.0 | 33.3 | 41.3 | 42.0 |
| Benchmark | Avg. CodeSIM | Avg. SolidCoder | Improvement |
|---|---|---|---|
| HumanEval | 97.0% | 97.2% | +0.2%p |
| CodeContests | 85.3% | 89.1% | +3.8%p |
| APPS | 34.6% | 36.5% | +1.9%p |
The largest gains appear on CodeContests—where mental simulation begins to fail but problems remain tractable for execution-grounded verification. Notably, Grok-4.1-Fast with SolidCoder reaches 98.2% on CodeContests, approaching ceiling performance.
Figure 1: Comparison between CodeSIM (mental simulation) and SolidCoder (concrete execution). Gray boxes are shared with CodeSIM; blue boxes are S.O.L.I.D. components introduced by SolidCoder.
The core insight of SolidCoder is that LLMs hallucinate execution traces during mental simulation:
| Approach | Verification Method | Limitation |
|---|---|---|
| CodeSIM | Mental Simulation ("imagines") | Hallucinates correct behavior for buggy code |
| SolidCoder | Live Execution ("executes") | Grounds verification in runtime reality |
Figure 2: Concrete example demonstrating the Mental-Reality Gap. Left (CodeSIM): Mental simulation hallucinates correct behavior for buggy code, incorrectly validating a flawed solution. Right (SolidCoder): Live execution catches the bug through concrete runtime feedback, enabling proper debugging and test accumulation. This illustrates how execution-grounded verification detects errors that mental simulation misses.
- Python 3.10+
- Docker (for ExecEval sandbox, required for APPS & CodeContests)
# Clone repository
git clone https://github.com/Anonymous-ARR-Submissions/SolidCoder.git
cd SolidCoder
# Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
# or: venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# Set up environment variables
cp .env.example .env
# Edit .env with your API keys# Pull and run ExecEval container
docker pull ntunlp/execeval
docker run -d -p 5000:5000 --name execeval ntunlp/execeval
# Verify
curl http://127.0.0.1:5000/api/all_runtimes# Set API key
export OPENROUTER_API_KEY='your-api-key'
# Run SolidCoder with all S.O.L.I.D. components
PYTHONPATH=./src python src/main.py \
--dataset HumanEval \
--strategy SolidCoder \
--model openai/gpt-4o-2024-08-06 \
--model_provider OpenRouter \
--enable_shift_left \
--enable_oracle_assert \
--enable_live_verify \
--enable_inter_sim \
--enable_defensive_test \
--temperature 0 \
--verbose 1Each component can be individually enabled/disabled for ablation studies:
| Flag | Component | Description |
|---|---|---|
--enable_shift_left |
[S] | Edge-case identification before planning |
--enable_oracle_assert |
[O] | Property-based test generation |
--enable_live_verify |
[L] | Sandboxed code execution |
--enable_inter_sim |
[I] | Intermediate code simulation |
--enable_defensive_test |
[D] | Failing test accumulation |
| Option | Values | Description |
|---|---|---|
--dataset |
HumanEval, CC, APPS |
Benchmark dataset |
--strategy |
Direct, CoT, SelfPlanning, Analogical, MapCoder, CodeSIM, SolidCoder |
Prompting strategy |
--model |
openai/gpt-4o-2024-08-06, openai/gpt-oss-120b, x-ai/grok-4.1-fast |
Model name (OpenRouter format) |
--model_provider |
OpenAI, OpenRouter, Anthropic, Gemini, vLLM |
API provider |
--temperature |
0 (default) |
Sampling temperature |
--verbose |
0, 1, 2 |
Logging verbosity (2 = full trace) |
--store_log_in_file |
yes, no |
Save detailed logs |
--cont |
yes, no |
Resume from existing results |
# OpenRouter (Recommended - supports all models)
export OPENROUTER_API_KEY='your-api-key'
# OpenAI
export OPENAI_API_KEY='your-api-key'
# Anthropic
export ANTHROPIC_API_KEY='your-api-key'| Dataset | Problems | Difficulty | ExecEval Required |
|---|---|---|---|
| HumanEval | 164 | Easy | No |
| CodeContests (CC) | 165 | Medium | Yes |
| APPS | 150 | Hard | Yes |
| Model | Description | OpenRouter ID |
|---|---|---|
| GPT-4o | Primary baseline (comparison with CodeSIM) | openai/gpt-4o-2024-08-06 |
| GPT-OSS-120B | Open-source GPT, RL post-trained | openai/gpt-oss-120b |
| Grok-4.1-Fast | Non-GPT frontier model, RL post-trained | x-ai/grok-4.1-fast |
# Run all strategies for a given model and dataset
for strategy in Direct CoT SelfPlanning Analogical MapCoder CodeSIM SolidCoder; do
PYTHONPATH=./src python src/main.py \
--dataset CC \
--strategy $strategy \
--model openai/gpt-4o-2024-08-06 \
--model_provider OpenRouter \
--temperature 0 \
--verbose 1 \
--store_log_in_file yes \
$([ "$strategy" == "SolidCoder" ] && echo "--enable_shift_left --enable_oracle_assert --enable_live_verify --enable_inter_sim --enable_defensive_test")
done# Full SolidCoder (baseline)
PYTHONPATH=./src python src/main.py \
--dataset CC --strategy SolidCoder --model openai/gpt-4o-2024-08-06 --model_provider OpenRouter \
--enable_shift_left --enable_oracle_assert --enable_live_verify --enable_inter_sim --enable_defensive_test
# w/o Shift-left (-S)
PYTHONPATH=./src python src/main.py \
--dataset CC --strategy SolidCoder --model openai/gpt-4o-2024-08-06 --model_provider OpenRouter \
--enable_oracle_assert --enable_live_verify --enable_inter_sim --enable_defensive_test
# w/o Oracle (-O)
PYTHONPATH=./src python src/main.py \
--dataset CC --strategy SolidCoder --model openai/gpt-4o-2024-08-06 --model_provider OpenRouter \
--enable_shift_left --enable_live_verify --enable_inter_sim --enable_defensive_test
# w/o Live Execution (-L)
PYTHONPATH=./src python src/main.py \
--dataset CC --strategy SolidCoder --model openai/gpt-4o-2024-08-06 --model_provider OpenRouter \
--enable_shift_left --enable_oracle_assert --enable_inter_sim --enable_defensive_test
# w/o Intermediate Simulation (-I)
PYTHONPATH=./src python src/main.py \
--dataset CC --strategy SolidCoder --model openai/gpt-4o-2024-08-06 --model_provider OpenRouter \
--enable_shift_left --enable_oracle_assert --enable_live_verify --enable_defensive_test
# w/o Defensive Accumulation (-D)
PYTHONPATH=./src python src/main.py \
--dataset CC --strategy SolidCoder --model openai/gpt-4o-2024-08-06 --model_provider OpenRouter \
--enable_shift_left --enable_oracle_assert --enable_live_verify --enable_inter_simSolidCoder/
├── src/
│ ├── main.py # Entry point
│ ├── promptings/
│ │ ├── SolidCoder.py # S.O.L.I.D. implementation
│ │ ├── CodeSIM.py # Baseline reimplementation
│ │ ├── MapCoder.py # Baseline
│ │ ├── SelfPlanning.py # Baseline
│ │ ├── Analogical.py # Baseline
│ │ ├── CoT.py # Baseline
│ │ └── Direct.py # Baseline
│ ├── models/ # LLM providers
│ │ ├── OpenRouterModel.py # OpenRouter API
│ │ ├── OpenAI.py # OpenAI API
│ │ ├── Anthropic.py # Claude API
│ │ ├── Gemini.py # Google API
│ │ └── VLLMModel.py # Local vLLM
│ ├── datasets/ # Dataset loaders
│ │ ├── HumanEvalDataset.py
│ │ ├── CodeContestDataset.py
│ │ └── APPSDataset.py
│ └── evaluations/ # Code execution & evaluation
├── data/
│ ├── HumanEval/HumanEval.jsonl
│ ├── CodeContest/Test.jsonl
│ └── APPS/selected150.jsonl
├── results/ # Experiment outputs
└── scripts/
└── run_experiments.sh # Batch experiment runner
Results are saved in:
results/{Dataset}/{Strategy}/{Model}/Python3-{temp}-0.95-1/Run-{n}/
├── Results.jsonl # Per-problem results with generated code
├── Log.txt # Execution log with prompts and responses
└── Summary.txt # Statistics summary
Check accuracy:
# View summary statistics
cat results/CC/SolidCoder/gpt-4o/*/Run-1/Summary.txt
# Get final accuracy from log
grep "number of success" results/CC/SolidCoder/gpt-4o/*/Run-1/Log.txt | tail -1
# Output: completed 165/165, Solved: True, number of success = 127/165, acc = 76.97SolidCoder implements safety measures for live code execution:
| Mechanism | Implementation | Purpose |
|---|---|---|
| Timeout | 5-second limit | Prevent infinite loops |
| Input Blocking | input() → RuntimeError |
Prevent hanging on stdin |
| Isolated Namespace | Fresh exec_globals per run |
Prevent state leakage |
| ExecEval Docker | Sandboxed container | Secure execution for competition problems |
| | Generation | | | | Verification | | |
| Approach | Exemplar | Plan | Edge Aware | Mental Sim. | Live Exec. | Debug | Defensive |
|---|---|---|---|---|---|---|---|
| Reflexion | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ |
| Self-Planning | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ |
| Analogical | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| LATS | ✗ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ |
| MapCoder | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ |
| CodeSIM | ✓ | ✓ | ✗ | ✓ | ✗ | ✓ | ✗ |
| SolidCoder | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
SolidCoder uniquely combines all capabilities: exemplar-based planning with edge-case awareness, mental simulation augmented by live execution, and defensive test accumulation.
@inproceedings{lee2026solidcoder,
title={SolidCoder: Bridging the Mental-Reality Gap in LLM Code Generation through Concrete Execution},
author={Woojin Lee and Jin-Xia Huang},
booktitle={Findings of the Association for Computational Linguistics: ACL 2026},
year={2026}
}This project is licensed under the MIT License - see the LICENSE file for details.
This work builds upon the CodeSIM framework. We thank the authors for their foundational contributions to multi-agent code generation.

