An OpenEnv-compliant RL environment where an AI agent reviews buggy Python code, detects issues, suggests and applies fixes, and predicts severity — simulating a real pull-request review workflow.
# Clone and install
git clone https://github.com/your-username/AutoCodeRL
cd AutoCodeRL
pip install -r requirements.txt
# Start server
python -m uvicorn env.environment:app --host 0.0.0.0 --port 7860
# Run baseline inference
cp .env.example .env # fill in HF_TOKEN
python inference.pyAutoCodeRL simulates a software code review workflow. The agent acts as a senior reviewer examining Python code snippets for bugs across 5 categories, working through a structured multi-stage pipeline.
| Property | Value |
|---|---|
| Task domain | Python code review |
| Bug categories | syntax, logic, security, performance, style |
| Severity levels | low, medium, high, critical |
| Episode types | Single-bug and multi-bug |
| Grader type | Deterministic (AST, similarity, pattern matching) |
| Reward range | [0.0, 1.0] per episode |
Per the OpenEnv spec, HTTP endpoints are not accessible on HF Spaces. Connect via WebSocket:
# Using the built-in WebSocket client
from client.ws_client import AutoCodeRLWSClient
with AutoCodeRLWSClient.from_hub("your-username/autocoderl", task_name="easy") as env:
result = env.reset() # StepResult
result = env.step(action) # StepResult
# Or from Docker
with AutoCodeRLWSClient.from_docker_image("autocoderl:latest") as env:
result = env.reset()import websockets, json, asyncio
async def run():
async with websockets.connect("wss://your-space.hf.space/ws") as ws:
# Reset
await ws.send(json.dumps({"type": "reset", "task_name": "easy"}))
resp = json.loads(await ws.recv())
obs = resp["result"]["observation"]
# Step
action = {"action_type": "detect_bug", "bug_category": "syntax"}
await ws.send(json.dumps({"type": "step", "task_name": "easy", "action": action}))
resp = json.loads(await ws.recv())
print(resp["result"]["reward"])
asyncio.run(run())from env.environment import AutoCodeRLEnv
from models.action import Action
with AutoCodeRLEnv(task_name="medium", seed=42) as env:
result = env.reset() # StepResult
print(result.observation.code_snippet)
action = Action(action_type="detect_bug", bug_category="logic")
result = env.step(action) # StepResult
print(result.reward, result.done)All API methods return a StepResult (official OpenEnv interface):
class StepResult:
observation: Observation # current environment state
reward: float | None # cumulative reward so far (None after reset)
done: bool # episode complete
info: dict # debug infoaction_type |
Required fields | Task |
|---|---|---|
detect_bug |
bug_category |
easy, medium, hard |
suggest_fix |
suggested_fix |
medium, hard |
apply_fix |
suggested_fix |
hard |
predict_severity |
severity_prediction |
hard |
approve_code |
— | any (penalised if code has bugs) |
request_changes |
— | any |
task_id, task_difficulty, code_snippet, file_name,
bug_category, bug_hint, test_status, step_number, max_steps,
previous_action, previous_action_feedback,
severity, multi_bug, bugs_found, bugs_total
Detect the category of a single syntax bug.
Reward: +1.0 correct, +0.2 wrong category, -0.3 approve on buggy code.
Detect the bug then suggest a correct fix.
Reward: +0.3 detect + up to +0.7 fix = 1.0 perfect.
Multi-bug tasks: detect → suggest → apply → predict severity.
Reward: +0.30 detect + +0.25 suggest + +0.30 apply + +0.15 severity = 1.0.
Correct bug detection +0.30 (per bug for multi-bug tasks)
Correct fix suggestion +0.25 (similarity-weighted)
Correct fix application +0.30 (AST check or similarity)
Correct severity prediction +0.15 (exact match)
Off-by-one severity +0.08 (partial credit)
Approve buggy code -0.40 (critical penalty)
Repeated identical action -0.10 (per repeat)
Wasted step (score=0) -0.02 (encourages efficiency)
All rewards clip cumulatively to [0.0, 1.0].
Reproduced by running python inference.py:
| Task | Score | Model |
|---|---|---|
| easy | ~0.80 | Qwen2.5-72B-Instruct |
| medium | ~0.55 | Qwen2.5-72B-Instruct |
| hard | ~0.35 | Qwen2.5-72B-Instruct |
| Method | Path | Description |
|---|---|---|
| WS | /ws |
Primary — persistent session, one per connection |
| GET | /health |
Returns {"status": "healthy"} |
| GET | /web |
Interactive browser-based tester |
| GET | /docs |
OpenAPI documentation |
| POST | /reset/{task} |
Stateless HTTP reset (local dev only) |
| POST | /step/{task} |
Stateless HTTP step (local dev only) |
| GET | /state/{task} |
Stateless HTTP state (local dev only) |
| GET | /tasks |
List available tasks |
docker build -t autocoderl .
docker run -p 7860:7860 \
-e HF_TOKEN=hf_your_token \
-e API_BASE_URL=https://router.huggingface.co/v1 \
-e MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
autocoderlpytest tests/ -vchmod +x scripts/validate.sh
./scripts/validate.sh
# With live HF Space:
./scripts/validate.sh https://your-username-autocoderl.hf.spaceAutoCodeRL/
├── env/
│ ├── environment.py # FastAPI app + WebSocket /ws + AutoCodeRLEnv
│ ├── reward.py # Shaped reward engine
│ ├── grader.py # Deterministic graders (syntax/test/AST)
│ ├── simulator.py # Episode state machine
│ └── validator.py # Pre-submission checks
├── models/
│ ├── observation.py # Pydantic Observation
│ ├── action.py # Pydantic Action
│ ├── reward.py # Pydantic Reward
│ └── step_result.py # StepResult (official OpenEnv interface)
├── tasks/
│ ├── base_task.py # Abstract base class
│ ├── easy_task.py # Detect-only
│ ├── medium_task.py # Detect + suggest fix
│ └── hard_task.py # Full 4-stage pipeline
├── client/
│ └── ws_client.py # WebSocket client (from_hub / from_docker_image)
├── data/
│ ├── bug_dataset.json # 9 tasks (3 per difficulty)
│ └── bug_generator.py # Dynamic task generation
├── tests/
│ ├── conftest.py
│ ├── test_env.py # API + WebSocket + health tests
│ ├── test_graders.py # Grader determinism tests
│ └── test_tasks.py # Task grading tests
├── scripts/
│ ├── validate.sh # Pre-submission validator
│ └── run_baseline.sh
├── .github/workflows/ci.yml
├── inference.py # [START][STEP][END] baseline script
├── openenv.yaml # OpenEnv metadata
├── Dockerfile
├── requirements.txt
├── .env.example
└── README.md
| Variable | Required | Default | Description |
|---|---|---|---|
HF_TOKEN |
Yes | — | HuggingFace API key |
API_BASE_URL |
No | https://router.huggingface.co/v1 |
LLM endpoint |
MODEL_NAME |
No | Qwen/Qwen2.5-72B-Instruct |
Model identifier |
PORT |
No | 7860 |
Server port |
WORKERS |
No | 1 |
Uvicorn workers |
MAX_CONCURRENT_ENVS |
No | 100 |
Max WebSocket sessions |
Per the official OpenEnv scaling docs:
- Free tier (HF Spaces): ~128 concurrent WebSocket sessions
- Local Docker: Up to 2,048 concurrent sessions with 8 workers
- Multi-node: Horizontal scaling via load balancer
# Scale locally
docker run -p 7860:7860 -e WORKERS=4 -e MAX_CONCURRENT_ENVS=400 autocoderl- HF Space deploys and responds to
/healthwith{"status": "healthy"} - WebSocket
/wsendpoint for judge evaluation - OpenEnv spec: typed Pydantic models,
reset()/step()/state()returningStepResult - Dockerfile builds and starts cleanly
-
inference.pyemits[START]/[STEP]/[END] - 3 tasks with deterministic graders returning scores in
[0.0, 1.0] -
from_hub()andfrom_docker_image()class methods - Interactive
/webUI for browser testing - 40+ unit tests across env, graders, tasks
MIT