Skip to content

Ram999999/AutoCodeRL

Repository files navigation

AutoCodeRL — AI Autonomous Code Review and Bug Fix Environment

An OpenEnv-compliant RL environment where an AI agent reviews buggy Python code, detects issues, suggests and applies fixes, and predicts severity — simulating a real pull-request review workflow.


Quick Start

# Clone and install
git clone https://github.com/your-username/AutoCodeRL
cd AutoCodeRL
pip install -r requirements.txt

# Start server
python -m uvicorn env.environment:app --host 0.0.0.0 --port 7860

# Run baseline inference
cp .env.example .env  # fill in HF_TOKEN
python inference.py

Environment Description

AutoCodeRL simulates a software code review workflow. The agent acts as a senior reviewer examining Python code snippets for bugs across 5 categories, working through a structured multi-stage pipeline.

Property Value
Task domain Python code review
Bug categories syntax, logic, security, performance, style
Severity levels low, medium, high, critical
Episode types Single-bug and multi-bug
Grader type Deterministic (AST, similarity, pattern matching)
Reward range [0.0, 1.0] per episode

Connecting to the Environment

Primary: WebSocket /ws (required for HF Spaces)

Per the OpenEnv spec, HTTP endpoints are not accessible on HF Spaces. Connect via WebSocket:

# Using the built-in WebSocket client
from client.ws_client import AutoCodeRLWSClient

with AutoCodeRLWSClient.from_hub("your-username/autocoderl", task_name="easy") as env:
    result = env.reset()       # StepResult
    result = env.step(action)  # StepResult

# Or from Docker
with AutoCodeRLWSClient.from_docker_image("autocoderl:latest") as env:
    result = env.reset()

WebSocket protocol (raw)

import websockets, json, asyncio

async def run():
    async with websockets.connect("wss://your-space.hf.space/ws") as ws:
        # Reset
        await ws.send(json.dumps({"type": "reset", "task_name": "easy"}))
        resp = json.loads(await ws.recv())
        obs = resp["result"]["observation"]

        # Step
        action = {"action_type": "detect_bug", "bug_category": "syntax"}
        await ws.send(json.dumps({"type": "step", "task_name": "easy", "action": action}))
        resp = json.loads(await ws.recv())
        print(resp["result"]["reward"])

asyncio.run(run())

Local programmatic use

from env.environment import AutoCodeRLEnv
from models.action import Action

with AutoCodeRLEnv(task_name="medium", seed=42) as env:
    result = env.reset()          # StepResult
    print(result.observation.code_snippet)

    action = Action(action_type="detect_bug", bug_category="logic")
    result = env.step(action)     # StepResult
    print(result.reward, result.done)

StepResult

All API methods return a StepResult (official OpenEnv interface):

class StepResult:
    observation: Observation   # current environment state
    reward:      float | None  # cumulative reward so far (None after reset)
    done:        bool          # episode complete
    info:        dict          # debug info

Action Space

action_type Required fields Task
detect_bug bug_category easy, medium, hard
suggest_fix suggested_fix medium, hard
apply_fix suggested_fix hard
predict_severity severity_prediction hard
approve_code any (penalised if code has bugs)
request_changes any

Observation Space

task_id, task_difficulty, code_snippet, file_name,
bug_category, bug_hint, test_status, step_number, max_steps,
previous_action, previous_action_feedback,
severity, multi_bug, bugs_found, bugs_total

Tasks

Easy — Bug Detection (max 3 steps)

Detect the category of a single syntax bug. Reward: +1.0 correct, +0.2 wrong category, -0.3 approve on buggy code.

Medium — Detection + Fix (max 5 steps)

Detect the bug then suggest a correct fix. Reward: +0.3 detect + up to +0.7 fix = 1.0 perfect.

Hard — Full Pipeline (max 8 steps)

Multi-bug tasks: detect → suggest → apply → predict severity. Reward: +0.30 detect + +0.25 suggest + +0.30 apply + +0.15 severity = 1.0.


Reward Function

Correct bug detection         +0.30  (per bug for multi-bug tasks)
Correct fix suggestion        +0.25  (similarity-weighted)
Correct fix application       +0.30  (AST check or similarity)
Correct severity prediction   +0.15  (exact match)
Off-by-one severity           +0.08  (partial credit)
Approve buggy code            -0.40  (critical penalty)
Repeated identical action     -0.10  (per repeat)
Wasted step (score=0)         -0.02  (encourages efficiency)

All rewards clip cumulatively to [0.0, 1.0].


Baseline Scores

Reproduced by running python inference.py:

Task Score Model
easy ~0.80 Qwen2.5-72B-Instruct
medium ~0.55 Qwen2.5-72B-Instruct
hard ~0.35 Qwen2.5-72B-Instruct

API Endpoints

Method Path Description
WS /ws Primary — persistent session, one per connection
GET /health Returns {"status": "healthy"}
GET /web Interactive browser-based tester
GET /docs OpenAPI documentation
POST /reset/{task} Stateless HTTP reset (local dev only)
POST /step/{task} Stateless HTTP step (local dev only)
GET /state/{task} Stateless HTTP state (local dev only)
GET /tasks List available tasks

Setup

Docker

docker build -t autocoderl .
docker run -p 7860:7860 \
  -e HF_TOKEN=hf_your_token \
  -e API_BASE_URL=https://router.huggingface.co/v1 \
  -e MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
  autocoderl

Tests

pytest tests/ -v

Pre-submission validation

chmod +x scripts/validate.sh
./scripts/validate.sh
# With live HF Space:
./scripts/validate.sh https://your-username-autocoderl.hf.space

Project Structure

AutoCodeRL/
├── env/
│   ├── environment.py   # FastAPI app + WebSocket /ws + AutoCodeRLEnv
│   ├── reward.py        # Shaped reward engine
│   ├── grader.py        # Deterministic graders (syntax/test/AST)
│   ├── simulator.py     # Episode state machine
│   └── validator.py     # Pre-submission checks
├── models/
│   ├── observation.py   # Pydantic Observation
│   ├── action.py        # Pydantic Action
│   ├── reward.py        # Pydantic Reward
│   └── step_result.py   # StepResult (official OpenEnv interface)
├── tasks/
│   ├── base_task.py     # Abstract base class
│   ├── easy_task.py     # Detect-only
│   ├── medium_task.py   # Detect + suggest fix
│   └── hard_task.py     # Full 4-stage pipeline
├── client/
│   └── ws_client.py     # WebSocket client (from_hub / from_docker_image)
├── data/
│   ├── bug_dataset.json # 9 tasks (3 per difficulty)
│   └── bug_generator.py # Dynamic task generation
├── tests/
│   ├── conftest.py
│   ├── test_env.py      # API + WebSocket + health tests
│   ├── test_graders.py  # Grader determinism tests
│   └── test_tasks.py    # Task grading tests
├── scripts/
│   ├── validate.sh      # Pre-submission validator
│   └── run_baseline.sh
├── .github/workflows/ci.yml
├── inference.py         # [START][STEP][END] baseline script
├── openenv.yaml         # OpenEnv metadata
├── Dockerfile
├── requirements.txt
├── .env.example
└── README.md

Environment Variables

Variable Required Default Description
HF_TOKEN Yes HuggingFace API key
API_BASE_URL No https://router.huggingface.co/v1 LLM endpoint
MODEL_NAME No Qwen/Qwen2.5-72B-Instruct Model identifier
PORT No 7860 Server port
WORKERS No 1 Uvicorn workers
MAX_CONCURRENT_ENVS No 100 Max WebSocket sessions

Scaling

Per the official OpenEnv scaling docs:

  • Free tier (HF Spaces): ~128 concurrent WebSocket sessions
  • Local Docker: Up to 2,048 concurrent sessions with 8 workers
  • Multi-node: Horizontal scaling via load balancer
# Scale locally
docker run -p 7860:7860 -e WORKERS=4 -e MAX_CONCURRENT_ENVS=400 autocoderl

Submission Checklist

  • HF Space deploys and responds to /health with {"status": "healthy"}
  • WebSocket /ws endpoint for judge evaluation
  • OpenEnv spec: typed Pydantic models, reset() / step() / state() returning StepResult
  • Dockerfile builds and starts cleanly
  • inference.py emits [START] / [STEP] / [END]
  • 3 tasks with deterministic graders returning scores in [0.0, 1.0]
  • from_hub() and from_docker_image() class methods
  • Interactive /web UI for browser testing
  • 40+ unit tests across env, graders, tasks

License

MIT

About

AI Autonomous Code Review and Bug Fix RL Environment built using OpenEnv. Supports multi-stage bug detection, fixing and reward based evaluation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors