AutoCodeRL — AI Autonomous Code Review and Bug Fix Environment

An OpenEnv-compliant RL environment where an AI agent reviews buggy Python code, detects issues, suggests and applies fixes, and predicts severity — simulating a real pull-request review workflow.

Quick Start

# Clone and install
git clone https://github.com/your-username/AutoCodeRL
cd AutoCodeRL
pip install -r requirements.txt

# Start server
python -m uvicorn env.environment:app --host 0.0.0.0 --port 7860

# Run baseline inference
cp .env.example .env  # fill in HF_TOKEN
python inference.py

Environment Description

AutoCodeRL simulates a software code review workflow. The agent acts as a senior reviewer examining Python code snippets for bugs across 5 categories, working through a structured multi-stage pipeline.

Property	Value
Task domain	Python code review
Bug categories	syntax, logic, security, performance, style
Severity levels	low, medium, high, critical
Episode types	Single-bug and multi-bug
Grader type	Deterministic (AST, similarity, pattern matching)
Reward range	`[0.0, 1.0]` per episode

Connecting to the Environment

Primary: WebSocket `/ws` (required for HF Spaces)

Per the OpenEnv spec, HTTP endpoints are not accessible on HF Spaces. Connect via WebSocket:

# Using the built-in WebSocket client
from client.ws_client import AutoCodeRLWSClient

with AutoCodeRLWSClient.from_hub("your-username/autocoderl", task_name="easy") as env:
    result = env.reset()       # StepResult
    result = env.step(action)  # StepResult

# Or from Docker
with AutoCodeRLWSClient.from_docker_image("autocoderl:latest") as env:
    result = env.reset()

WebSocket protocol (raw)

import websockets, json, asyncio

async def run():
    async with websockets.connect("wss://your-space.hf.space/ws") as ws:
        # Reset
        await ws.send(json.dumps({"type": "reset", "task_name": "easy"}))
        resp = json.loads(await ws.recv())
        obs = resp["result"]["observation"]

        # Step
        action = {"action_type": "detect_bug", "bug_category": "syntax"}
        await ws.send(json.dumps({"type": "step", "task_name": "easy", "action": action}))
        resp = json.loads(await ws.recv())
        print(resp["result"]["reward"])

asyncio.run(run())

Local programmatic use

from env.environment import AutoCodeRLEnv
from models.action import Action

with AutoCodeRLEnv(task_name="medium", seed=42) as env:
    result = env.reset()          # StepResult
    print(result.observation.code_snippet)

    action = Action(action_type="detect_bug", bug_category="logic")
    result = env.step(action)     # StepResult
    print(result.reward, result.done)

StepResult

All API methods return a StepResult (official OpenEnv interface):

class StepResult:
    observation: Observation   # current environment state
    reward:      float | None  # cumulative reward so far (None after reset)
    done:        bool          # episode complete
    info:        dict          # debug info

Action Space

`action_type`	Required fields	Task
`detect_bug`	`bug_category`	easy, medium, hard
`suggest_fix`	`suggested_fix`	medium, hard
`apply_fix`	`suggested_fix`	hard
`predict_severity`	`severity_prediction`	hard
`approve_code`	—	any (penalised if code has bugs)
`request_changes`	—	any

Observation Space

task_id, task_difficulty, code_snippet, file_name,
bug_category, bug_hint, test_status, step_number, max_steps,
previous_action, previous_action_feedback,
severity, multi_bug, bugs_found, bugs_total

Tasks

Easy — Bug Detection (max 3 steps)

Detect the category of a single syntax bug. Reward: +1.0 correct, +0.2 wrong category, -0.3 approve on buggy code.

Medium — Detection + Fix (max 5 steps)

Detect the bug then suggest a correct fix. Reward: +0.3 detect + up to +0.7 fix = 1.0 perfect.

Hard — Full Pipeline (max 8 steps)

Multi-bug tasks: detect → suggest → apply → predict severity. Reward: +0.30 detect + +0.25 suggest + +0.30 apply + +0.15 severity = 1.0.

Reward Function

Correct bug detection         +0.30  (per bug for multi-bug tasks)
Correct fix suggestion        +0.25  (similarity-weighted)
Correct fix application       +0.30  (AST check or similarity)
Correct severity prediction   +0.15  (exact match)
Off-by-one severity           +0.08  (partial credit)
Approve buggy code            -0.40  (critical penalty)
Repeated identical action     -0.10  (per repeat)
Wasted step (score=0)         -0.02  (encourages efficiency)

All rewards clip cumulatively to [0.0, 1.0].

Baseline Scores

Reproduced by running python inference.py:

Task	Score	Model
easy	~0.80	Qwen2.5-72B-Instruct
medium	~0.55	Qwen2.5-72B-Instruct
hard	~0.35	Qwen2.5-72B-Instruct

API Endpoints

Method	Path	Description
WS	`/ws`	Primary — persistent session, one per connection
GET	`/health`	Returns `{"status": "healthy"}`
GET	`/web`	Interactive browser-based tester
GET	`/docs`	OpenAPI documentation
POST	`/reset/{task}`	Stateless HTTP reset (local dev only)
POST	`/step/{task}`	Stateless HTTP step (local dev only)
GET	`/state/{task}`	Stateless HTTP state (local dev only)
GET	`/tasks`	List available tasks

Setup

Docker

docker build -t autocoderl .
docker run -p 7860:7860 \
  -e HF_TOKEN=hf_your_token \
  -e API_BASE_URL=https://router.huggingface.co/v1 \
  -e MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
  autocoderl

Tests

pytest tests/ -v

Pre-submission validation

chmod +x scripts/validate.sh
./scripts/validate.sh
# With live HF Space:
./scripts/validate.sh https://your-username-autocoderl.hf.space

Project Structure

AutoCodeRL/
├── env/
│   ├── environment.py   # FastAPI app + WebSocket /ws + AutoCodeRLEnv
│   ├── reward.py        # Shaped reward engine
│   ├── grader.py        # Deterministic graders (syntax/test/AST)
│   ├── simulator.py     # Episode state machine
│   └── validator.py     # Pre-submission checks
├── models/
│   ├── observation.py   # Pydantic Observation
│   ├── action.py        # Pydantic Action
│   ├── reward.py        # Pydantic Reward
│   └── step_result.py   # StepResult (official OpenEnv interface)
├── tasks/
│   ├── base_task.py     # Abstract base class
│   ├── easy_task.py     # Detect-only
│   ├── medium_task.py   # Detect + suggest fix
│   └── hard_task.py     # Full 4-stage pipeline
├── client/
│   └── ws_client.py     # WebSocket client (from_hub / from_docker_image)
├── data/
│   ├── bug_dataset.json # 9 tasks (3 per difficulty)
│   └── bug_generator.py # Dynamic task generation
├── tests/
│   ├── conftest.py
│   ├── test_env.py      # API + WebSocket + health tests
│   ├── test_graders.py  # Grader determinism tests
│   └── test_tasks.py    # Task grading tests
├── scripts/
│   ├── validate.sh      # Pre-submission validator
│   └── run_baseline.sh
├── .github/workflows/ci.yml
├── inference.py         # [START][STEP][END] baseline script
├── openenv.yaml         # OpenEnv metadata
├── Dockerfile
├── requirements.txt
├── .env.example
└── README.md

Environment Variables

Variable	Required	Default	Description
`HF_TOKEN`	Yes	—	HuggingFace API key
`API_BASE_URL`	No	`https://router.huggingface.co/v1`	LLM endpoint
`MODEL_NAME`	No	`Qwen/Qwen2.5-72B-Instruct`	Model identifier
`PORT`	No	`7860`	Server port
`WORKERS`	No	`1`	Uvicorn workers
`MAX_CONCURRENT_ENVS`	No	`100`	Max WebSocket sessions

Scaling

Per the official OpenEnv scaling docs:

Free tier (HF Spaces): ~128 concurrent WebSocket sessions
Local Docker: Up to 2,048 concurrent sessions with 8 workers
Multi-node: Horizontal scaling via load balancer

# Scale locally
docker run -p 7860:7860 -e WORKERS=4 -e MAX_CONCURRENT_ENVS=400 autocoderl

Submission Checklist

HF Space deploys and responds to /health with {"status": "healthy"}
WebSocket /ws endpoint for judge evaluation
OpenEnv spec: typed Pydantic models, reset() / step() / state() returning StepResult
Dockerfile builds and starts cleanly
inference.py emits [START] / [STEP] / [END]
3 tasks with deterministic graders returning scores in [0.0, 1.0]
from_hub() and from_docker_image() class methods
Interactive /web UI for browser testing
40+ unit tests across env, graders, tasks

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoCodeRL — AI Autonomous Code Review and Bug Fix Environment

Quick Start

Environment Description

Connecting to the Environment

Primary: WebSocket `/ws` (required for HF Spaces)

WebSocket protocol (raw)

Local programmatic use

StepResult

Action Space

Observation Space

Tasks

Easy — Bug Detection (max 3 steps)

Medium — Detection + Fix (max 5 steps)

Hard — Full Pipeline (max 8 steps)

Reward Function

Baseline Scores

API Endpoints

Setup

Docker

Tests

Pre-submission validation

Project Structure

Environment Variables

Scaling

Submission Checklist

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
client		client
data		data
env		env
models		models
scripts		scripts
tasks		tasks
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
inference.py		inference.py
openenv.yaml		openenv.yaml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

AutoCodeRL — AI Autonomous Code Review and Bug Fix Environment

Quick Start

Environment Description

Connecting to the Environment

Primary: WebSocket /ws (required for HF Spaces)

WebSocket protocol (raw)

Local programmatic use

StepResult

Action Space

Observation Space

Tasks

Easy — Bug Detection (max 3 steps)

Medium — Detection + Fix (max 5 steps)

Hard — Full Pipeline (max 8 steps)

Reward Function

Baseline Scores

API Endpoints

Setup

Docker

Tests

Pre-submission validation

Project Structure

Environment Variables

Scaling

Submission Checklist

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Primary: WebSocket `/ws` (required for HF Spaces)

Packages