| title | Code Review Agent (HF Space) |
|---|---|
| emoji | 🧠 |
| colorFrom | blue |
| colorTo | green |
| sdk | docker |
| app_port | 7860 |
| pinned | false |
Lines above: Hugging Face Space metadata (must stay at the top of this file for the live Space). Everything from the next heading onward is the normal GitHub readme.
An OpenEnv RL environment for training and benchmarking AI agents on Python code review.
Modern code review tools rarely expose structured, explainable training signals for automated agents. Teams mix linters, ad-hoc PR comments, and opaque model scores—making it hard to train, compare, and audit agents that assist with real review work.
This environment simulates human-like Python code review as an interactive loop: an agent reads curated snippets, acts (name the bug, submit a fix, or emit a structured audit), and receives a dense numeric reward after each step()—so you can benchmark policies, run curriculum learning, and measure partial progress instead of a single pass/fail bit.
- Automated code review assistants — reward-shaped feedback for models that suggest fixes and findings
- CI/CD and merge gates — compare agents on the same tasks with a stable, programmatic rubric
- Security-focused workflows — progression from bug naming → correct code → multi-category audit JSON
- Developer productivity tooling — reproducible head-to-head evaluation of review agents
- Education and hiring — consistent tasks, graders, and difficulty tiers (easy → medium → hard)
- Multi-dimensional grading (hard task) — separate signals for bugs, security issues, and style, then a fused score
- Real-world-style scenarios — hand-authored snippets with tests and gold reviews—not toy game MDPs
- Deterministic grading — same response on the same snippet yields the same reward (stochasticity only from which snippet is drawn unless you fix
seedonreset) - Designed for agent training — dense rewards every step, partial credit on medium/hard, clear episode boundaries
- OpenEnv + HF Space ready — typed observations/actions,
openenv.yaml, Docker, and a baselineinference.pythat passed Phase-2 validation
| Layer | Role |
|---|---|
| OpenEnv core | CodeReviewEnvironment — reset / step / state in server/environment.py |
| HTTP API | FastAPI app — /reset, /step, /health, /tasks in server/app.py |
| Task graders | tasks/task_easy.py, task_medium.py, task_hard.py |
| Reward safety | score_clamp.py keeps every returned reward strictly inside (0, 1) for evaluator compatibility |
| Baseline agent | inference.py — OpenAI-compatible client, Space URL via PING_URL, strict [START]/[STEP]/[END] stdout |
| Benchmark results | Baseline agent scores |
| Repository documentation | Complete guide |
| GitHub source | Default branch |
The environment is a Markov decision process: agents interact as follows.
state = env.reset(task_name="bug_identification")
↓ state contains: code_snippet, task_name, instructions, feedback, reward, done
for step in range(max_steps):
action = agent.act(state) # agent's response: bug type / code / JSON
state = env.step(action) # environment returns: observation + reward + done
if state.done:
break
At each step, the agent receives:
- Observation: code snippet and task instructions
- Reward (strictly inside
(0, 1)in API responses): immediate signal on response quality - Done: whether the episode ended (task solved or maximum steps reached)
The reward function is task-specific and always normalized so deployed APIs and baseline logs stay strictly inside the open unit interval (see score_clamp.py).
What each task emphasizes:
| Signal | Easy | Medium | Hard |
|---|---|---|---|
| Bug / logic detection | Primary (name the bug type) | Implicit in passing tests | ~40% weight on structured bug findings |
| Correctness / repair | — | Primary (fraction of tests passed) | — |
| Security issues | — | — | ~35% via security_issues list vs gold |
| Style / quality | — | — | ~15% via style_violations list vs gold |
| Response completeness | Aliases + synonym map + partial overlap | Runnable code + test coverage | JSON validity, keys, list shape, ordering |
Hard-task fusion: category scores are weighted, an ordering bonus is applied, then penalties cap how much missing critical items or likely hallucinations can reduce the total—before the final clamp.
Automated checks parse exact stdout lines. The baseline prints (illustrative):
[START] task=bug_identification env=code-review-env model=Qwen/Qwen2.5-72B-Instruct
[STEP] step=1 action="off-by-one error" reward=0.9900 done=true error=null
[END] success=true steps=1 score=0.9900 rewards=0.9900
The script ends with a single JSON object on stdout, e.g. {"task_1":0.85,"task_2":0.62,"task_3":0.41}, with each value strictly between 0 and 1.
class CodeReviewObservation(Observation):
code_snippet: str # Python code to review
task_name: str # "bug_identification" | "bug_fixing" | "full_review"
instructions: str # Task-specific instructions
feedback: Optional[str] # Grader feedback from previous step (null on first)
reward: float # Reward from previous step (small positive default on reset)
done: bool # Episode complete? (False on reset)class CodeReviewAction(Action):
response: str # Agent's answer (bug type string / fixed code / JSON review)| Task | Reward signal (conceptual) | After normalization |
|---|---|---|
| Bug ID (Easy) | Wrong / partial overlap / exact or alias | Clamped strictly inside (0, 1) |
| Bug Fix (Medium) | Proportional to tests passed (+ syntax/timeout paths) | Same |
| Full Review (Hard) | Weighted category scores + ordering − penalties | Same |
Dense signals at every step support stable training rather than sparse end-of-episode feedback.
Hugging Face Space: https://BugHunter28-code-review-env.hf.space
Select a task, submit a response, and inspect the returned reward in the UI or API.
Task: bug_identification (Easy)
observation = env.reset(task_name="bug_identification")Agent sees:
{
"code_snippet": "def sum_list(nums):\n total = 0\n for i in range(len(nums) + 1):\n total += nums[i]\n return total",
"task_name": "bug_identification",
"instructions": "Identify the bug and respond with ONLY the bug type as a short phrase.",
"feedback": null,
"reward": 0.01,
"done": false
}action = CodeReviewAction(response="off-by-one error")
observation = env.step(action)Agent receives (next state):
{
"code_snippet": "def sum_list(nums):\n total = 0\n for i in range(len(nums) + 1):\n total += nums[i]\n return total",
"task_name": "bug_identification",
"instructions": "Identify the bug and respond with ONLY the bug type as a short phrase.",
"feedback": "CORRECT: 'off-by-one error' matches expected bug type",
"reward": 0.99,
"done": true
}Result: The agent received a high reward (clamped below the upper bound per score_clamp.py) and the episode terminated.
| Aspect | Why it is RL | Implication |
|---|---|---|
| Markov property | Current observation fully determines the next state | Observable state is sufficient for the MDP formulation |
| Episode Structure | Fixed max steps, clear terminal conditions | Agents learn when to stop attempting retry |
| Reward Signal | Exact, partial, and scored outcomes | Dense learning signal (better than pass/fail) |
| Generalization | 32+ diverse snippets, 3 difficulty levels | Curriculum learning + benchmarking possible |
| Deterministic Grading | Same response → same score (reproducible) | Fair multi-agent comparison |
Summary: You can train an RL policy end-to-end on this environment: it is an interactive loop with explicit reset / step semantics, not a single-request API.
Three tasks with programmatic graders and increasing scope. Reported rewards are always strictly between 0 and 1 (never exactly 0.0 or 1.0 at the API boundary).
Simulates triage: given buggy backend-style Python, the agent must output only the bug type as a short phrase (e.g. "off-by-one error", "division by zero").
- Focus: Syntax-level mistakes, obvious logic flaws, and naming those patterns consistently with the dataset.
- Signal: Exact / alias / synonym match → highest band; partial keyword overlap → mid band; wrong or empty → lowest band (then clamped for the API).
Simulates patch review: the agent must return complete corrected Python for the shown function so it passes snippet-specific unit tests.
- Focus: Multiple issues can hide in one snippet—logic, structure, and API misuse are reflected in whether tests pass.
- Signal: Continuous score from fraction of tests passed; syntax errors, timeouts, and sandbox failures map to low bands (still strictly inside
(0, 1)).
Simulates structured audit: the agent must emit JSON with bugs, security_issues, and style_violations, each item with line, severity, description, ordered by severity within each list.
- Focus: Hidden logic issues, security-relevant patterns, and maintainability / style concerns—closer to how senior review comments read.
- Signal: Weighted match against gold findings per category, ordering bonus, then capped penalties (see Reward Design above).
Dataset: 32+ curated snippets—bug type, aliases, fixed code, tests, and gold JSON review per snippet.
Medium-task execution: Sandboxed exec with a whitelisted builtin surface and allowed imports (e.g. math, re, json); per-episode timeout 5s for the test thread.
Required output format:
{
"bugs": [
{"line": 3, "severity": "high", "description": "range(len(nums)+1) causes IndexError"}
],
"security_issues": [],
"style_violations": [
{"line": 3, "severity": "low", "description": "Use enumerate() or sum() instead of manual loop"}
]
}Category weights (hard task, conceptual 100% before penalties)
Bugs ≈ 40% · Security ≈ 35% · Style ≈ 15% · Severity ordering ≈ 10% — see comments in tasks/task_hard.py for the exact constants.
Illustrative decomposition (numbers before final clamp/penalties):
If category match scores normalized to ~2/3 bugs, ~1/2 security, ~0 style, plus full ordering credit, the unnormalized blend can land around 0.5+; the implementation then applies penalties and clamp so the API never returns a boundary value.
32 hand-crafted Python snippets, each covering a distinct real-world bug class:
| Bug Type | Category | Example Issue |
|---|---|---|
| Off-by-one error | Logic | range(len(nums) + 1) → IndexError |
| Division by zero | Safety | Missing zero check before / |
| Wrong initial value | Logic | total = 1 instead of 0 |
| Command injection | Security | Unsanitized shell command |
| Infinite recursion | Logic | Missing or wrong base case |
| Null pointer dereference | Safety | Access attribute on None |
| Mutating list while iterating | Logic | nums.remove() inside loop |
| Case sensitivity error | Logic | Missing .lower() for comparison |
| Resource leak | Safety | File not closed (missing context manager) |
| SQL injection | Security | String formatting in SQL query |
| ... and 22 more | Various | Real-world Python issues |
Each snippet includes:
- Buggy code (function definition)
- Bug type (canonical name and aliases)
- Fixed code (reference implementation)
- Test cases (approximately five to ten per snippet with expected outputs)
- Full review (expected bugs, security, and style findings)
Check structure:
curl http://localhost:7860/health
# Should return: {"status": "ok", "environment": "code-review-env", "tasks": [...]}Run one episode manually:
# Reset to get initial observation
curl -X POST http://localhost:7860/reset \
-H "Content-Type: application/json" \
-d '{"task_name": "bug_identification"}'
# Submit an action (copy code_snippet from reset response, identify bug)
curl -X POST http://localhost:7860/step \
-H "Content-Type: application/json" \
-d '{"response": "off-by-one error"}'
# Result: {"feedback": "...", "reward": 0.99, "done": true} # exact value clamped in (0,1)Checklist:
-
/resetreturns observation with code_snippet, task_name, instructions -
/stepaccepts action and returns reward + feedback + done - Reward is numeric, strictly inside
(0, 1), determinate for fixed snippet + response - Same action → same reward (determinism verified)
# Evaluate 3 agents (weak/baseline/strong) on all tasks
python scripts/benchmark.py --num-episodes 10 --seed 42 --json results.json
# Output: JSON + formatted table showing:
# - Weak agent: ~0.13 avg reward (random responses)
# - Baseline: ~0.46 avg reward (heuristic matching)
# - Gold: ~0.67 avg reward (oracle - proves environment grading works)
#
# Proves the environment provides meaningful signalSee EVALUATION_RESULTS.md for detailed baseline numbers and analysis.
Expected results (from version 0.1.0):
| Agent | Easy | Medium | Hard | Overall |
|---|---|---|---|---|
| Weak (random) | 0.15 | 0.05 | 0.10 | 0.10 |
| Baseline (heuristic) | 0.65 | 0.40 | 0.30 | 0.45 |
| Strong (GPT-4 caliber) | 0.88 | 0.72 | 0.65 | 0.75 |
Baseline = rule-based responses; Strong = LLM with code understanding.
python scripts/benchmark.py --num-episodes 5 --seed 42
python scripts/benchmark.py --num-episodes 5 --seed 42
# Expect identical aggregate scores when the agent policy is deterministic.Also run python validate_inference_format.py to confirm the stdout contract for automated judging.
| Metric | What It Measures | Benchmark Threshold |
|---|---|---|
| Easy task avg reward | Bug-naming accuracy | Weak: 0.1, Baseline: 0.6, Strong: 0.85+ |
| Medium task avg reward | Code-fixing capability | Weak: 0.05, Baseline: 0.4, Strong: 0.70+ |
| Hard task avg reward | Structured review quality | Weak: 0.1, Baseline: 0.3, Strong: 0.65+ |
| Overall (mean of tasks) | Agent overall code review skill | Weak: 0.08, Baseline: 0.43, Strong: 0.73+ |
To train an RL agent on this environment:
from server.environment import CodeReviewEnvironment
env = CodeReviewEnvironment()
for episode in range(num_episodes):
obs = env.reset(task_name="bug_identification") # or bug_fixing / full_review
for step in range(5): # max 5 steps per episode
action = agent.act(obs) # agent's policy
obs = env.step(action)
# Use obs.reward for training signal
agent.train(reward=obs.reward, done=obs.done)
if obs.done:
breakNo external datasets needed; grading is built-in.
Observation (what agent sees):
class CodeReviewObservation(Observation):
code_snippet: str # Raw Python code string
task_name: str # "bug_identification" | "bug_fixing" | "full_review"
instructions: str # Task-specific instruction text
feedback: Optional[str] # Grader feedback from previous step
reward: float # Reward from last step, strictly inside (0, 1)
done: bool # Episode over?Action (what agent submits):
class CodeReviewAction(Action):
response: str # Varies by task:
# - Easy: bug type phrase (str)
# - Medium: Python code (str, may contain newlines)
# - Hard: JSON string (parsed and graded)Episode Parameters:
- Max steps per episode: 5
- Terminal condition:
done=truewhen task solved OR max steps reached - Reward range: open
(0, 1)at the API (clamped away from exact 0.0 / 1.0)
.
├── data/
│ └── snippets.py # 32 hand-crafted code snippets + test cases
├── scripts/
│ ├── benchmark.py # Multi-agent evaluation (weak/baseline/strong)
│ └── eval_batch.py # Deprecated: use benchmark.py
├── server/
│ ├── app.py # FastAPI server (OpenEnv endpoints)
│ ├── environment.py # Core RL environment implementation
│ └── __init__.py
├── tasks/
│ ├── task_easy.py # Bug identification grader
│ ├── task_medium.py # Bug fixing grader
│ ├── task_hard.py # Full review grader
│ └── __init__.py
├── web/
│ ├── index.html # Web UI (optional, for manual testing)
│ ├── script.js
│ └── styles.css
├── inference.py # Example: LLM-based agent
├── client.py # Example: Python client
├── models.py # Pydantic models (shared types)
├── openenv.yaml # OpenEnv metadata
├── pyproject.toml # Python package config
├── validate_inference_format.py # Local check: stdout contract for judges
├── score_clamp.py # Shared strict (0,1) reward clamp
├── Dockerfile # Docker build for HF Spaces
├── run_tests.py # Quick test suite
└── README.md # This file
# Install dependencies
pip install -e .
# Start the server (auto-reload enabled)
uvicorn server.app:app --host 0.0.0.0 --port 7860 --reloadThen open http://localhost:7860 in a browser or HTTP client.
docker build -t code-review-env .
docker run -p 7860:7860 code-review-envpython scripts/benchmark.py \
--num-episodes 10 \
--seed 42 \
--output results.jsonOutput: Table + JSON file with scores per task.
- Create a class implementing
Agentinterface:
class MyAgent:
def act(self, observation: dict) -> str:
# observation has: code_snippet, task_name, instructions, feedback
# return: your response (bug type / fixed code / JSON review)
return agent_response- Edit
scripts/benchmark.pyto include your agent:
agents = {
"my_custom_agent": MyAgent(),
}
python scripts/benchmark.py --agent my_custom_agent# Quick sanity check
python run_tests.py
# Verbose output
python run_tests.py --verboseVerifies:
- All three tasks reset correctly
- Rewards strictly inside
(0, 1) - Test execution sandbox behavior for the medium task
- JSON parsing for the hard task
Grading is deterministic: for a fixed snippet and a fixed agent response string, the grader returns the same reward every time—no random noise inside task_easy / task_medium / task_hard.
Episode variety: unless you fix randomness, reset() draws a snippet from the dataset (random.choice when no seed is pinned). For bit-for-bit reproducible episodes:
# Benchmark / batch tools accept --seed
python scripts/benchmark.py --seed 42
# Environment API (Python)
from server.environment import CodeReviewEnvironment
env = CodeReviewEnvironment(seed=42)
obs = env.reset(task_name="bug_identification", seed=42)Inference baseline: set PING_URL to your Space root when running inference.py in CI so the same deployment is hit every run.
For publications or submissions, state explicitly: seed values and commit hash used for tables.
| Decision | Rationale | Trade-off |
|---|---|---|
| Dense rewards | Enables faster training | Requires more grading logic |
| 3-difficulty curriculum | Agents can learn progressively | More tasks to implement |
| Sandboxed execution | Safe code evaluation | Timeout + import restrictions |
| F1-based hard task | Rewards precision AND recall | More complex scoring |
| Fixed 5-step episodes | Agents learn strategic stopping | May be too short for complex reviews |
- OpenEnv Spec: https://github.com/openenv-dev/openenv-core
- Hugging Face Spaces: https://huggingface.co/docs/hub/spaces
- RL Benchmarking: https://openrl-benchmark.com/
- Live GitHub / PR integration — stream real diffs into the same
Observationschema - Multi-language reviews — TypeScript, Go, or Rust snippets with parallel grader interfaces
- LLM-in-the-loop grading — ensemble or critic models on top of deterministic rubrics
- Collaborative agents — multi-turn review threads with tool use (linter, test runner)
- Curriculum schedules — automatic promotion easy → medium → hard by policy performance
MIT License — See LICENSE file for details.
| File | Purpose |
|---|---|
| TEST_GUIDE.md | Manual testing scenarios and answer formats |
| EVALUATION_RESULTS.md | Baseline agent numbers and analysis |
| HOW_TO_CONFIGURE_HF_SPACE.md | Space secrets (HF_TOKEN, API_BASE_URL, …) |
| README_HF_SPACE.md | Hugging Face Space README: required YAML (title, description, sdk, app_port, models) and short introduction; use as the Space root README.md when deploying |
- Issues: GitHub Issues
- Discussions: GitHub Discussions on the same repository
- Live demo: code-review-env on Hugging Face Space