| title | Rust Coder OpenEnv | |||
|---|---|---|---|---|
| emoji | 🦀 | |||
| colorFrom | red | |||
| colorTo | yellow | |||
| sdk | docker | |||
| app_port | 8000 | |||
| base_path | /web | |||
| pinned | false | |||
| tags |
|
Rust Coder is a high-fidelity OpenEnv environment designed to evaluate and train LLM agents on real-world Rust systems programming tasks. Unlike toy environments, Rust Coder simulates valid engineering scenarios involving the borrow checker, concurrency, and memory safety.
Rust is uniquely challenging for AI agents due to its strict compile-time safety guarantees. This environment provides a 10-task progression that measures an agent's ability to:
- Fix borrow checker violations
- Correctly annotate lifetimes
- Resolve concurrency deadlocks
- Write unsafe FFI code correctly
- Identify and prevent memory leaks
- Optimize data pipelines for performance
Type: RustCoderAction
The agent submits a single string containing the complete, fixed Rust source code.
| Field | Type | Description |
|---|---|---|
code |
string | Full Rust source code to compile and test |
Type: RustCoderObservation
The environment returns detailed feedback after each submission:
| Field | Type | Description |
|---|---|---|
problem_description |
string | Task requirements and context |
header_section |
string | LeetCode-style scaffold (imports + signatures/types) |
compilation_success |
bool | Whether rustc compiled the submitted code |
compilation_output |
string | Raw compiler errors and warnings |
test_results |
list[dict] | Per-test pass/fail results with error details |
reward_breakdown |
dict | Weighted score breakdown across 5 dimensions |
Total reward is a weighted sum of 5 dimensions, each normalized to [0, 1]:
| Dimension | Weight | Metric |
|---|---|---|
| Compilation | 40% | Binary success/failure of rustc |
| Correctness | 20% | Fraction of test assertions that pass |
| Coverage | 20% | Fraction of tests that successfully ran |
| Elegance | 10% | Code quality heuristics (avoids .unwrap(), long lines, unsafe) |
| Efficiency | 10% | Execution time vs. per-problem baseline |
Reward provides partial signal at every step — compilation alone earns 0.40, passing all tests earns up to 1.0.
10 sequential problems with increasing difficulty:
| ID | Title | Difficulty | Skill Evaluated |
|---|---|---|---|
| 1 | Broken CLI Argument Parser | Easy | Enums & pattern matching |
| 2 | Conflicting Borrows | Easy→Med | Borrow checker |
| 3 | Invalid Lifetime Annotations | Medium | Lifetime annotations |
| 4 | Business Logic Errors | Medium | Math & correctness |
| 5 | Linked List Management | Medium | Ownership & data structures |
| 6 | Multi-threaded Deadlocks | Hard | Mutex & concurrency |
| 7 | Async Borrowing Conflicts | Hard | Async/await lifetimes |
| 8 | Unsafe FFI Integration | Hard | unsafe & C interop |
| 9 | Inefficient Data Pipeline | Hard | Performance optimization |
| 10 | Memory Leak Prevention | Hard+ | Weak pointers & ownership |
The environment reads the following variables. Set them as HF Space secrets (Settings → Variables and Secrets) when deploying to Hugging Face, or in a local .env file for development.
| Variable | Required | Default | Description |
|---|---|---|---|
HF_TOKEN |
Yes | — | Hugging Face API token for LLM calls |
API_BASE_URL |
No | https://router.huggingface.co/v1 |
Inference endpoint |
MODEL_NAME |
No | Qwen/Qwen2.5-72B-Instruct |
Model to use for evaluation |
Note: The
.envfile is excluded from Docker images by.dockerignore. On HF Spaces, secrets are injected as OS environment variables by the platform —load_dotenv()silently does nothing if no file is present, andos.getenv()reads from the platform-injected vars. This is the correct behavior.
# 1. Clone and enter the repo
git clone https://github.com/your-username/rust_coder
cd rust_coder
# 2. Create .env with your credentials
cat > .env << EOF
HF_TOKEN=hf_your_token_here
API_BASE_URL=https://router.huggingface.co/v1
MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
EOF
# 3. Build the Docker image (uses root Dockerfile)
docker build -t rust_coder:latest .
# 4. Run the environment server
docker run -d -p 8000:8000 --env-file .env --name rust_env rust_coder:latest
# 5. Verify it's healthy
curl http://localhost:8000/health
# → {"status": "healthy"}
# 6. Run the inference benchmark
python inference.py# Build
docker build -t rust_coder:latest .
# Run with .env file
docker run -d -p 8000:8000 --env-file .env --name rust_env rust_coder:latest
# View logs
docker logs rust_env
# Stop
docker stop rust_env# Reset (returns first problem)
curl -X POST http://localhost:8000/reset
# Step (submit Rust code)
curl -X POST http://localhost:8000/step \
-H "Content-Type: application/json" \
-d '{"action": {"code": "fn main() { println!(\"hello\"); }"}}'
# Health check
curl http://localhost:8000/health# Install HF CLI
pip install huggingface_hub
# Login
huggingface-cli login
# Push to Space
openenv push --repo-id your-username/rust-coderThen go to your Space settings and add secrets:
HF_TOKEN→ your Hugging Face API tokenMODEL_NAME→ e.g.Qwen/Qwen2.5-72B-Instruct
Baseline using Qwen/Qwen2.5-72B-Instruct via Hugging Face router:
| Metric | Score |
|---|---|
| Average reward | 0.59 |
| Compilation % | ~85% |
| Correctness % | ~45% |
rust_coder/
├── Dockerfile # Root Dockerfile (used by validator + HF Spaces)
├── server/Dockerfile # Identical copy (used for -f flag builds)
├── openenv.yaml # OpenEnv spec metadata
├── pyproject.toml # Python package config
├── uv.lock # Locked dependencies
├── problems.json # 10 coding problems dataset
├── models.py # Pydantic action/observation types
├── client.py # WebSocket client for RustCoderEnv
├── inference.py # Baseline inference script (entry point)
├── __init__.py # Package exports
└── server/
├── app.py # FastAPI OpenEnv server entrypoint
└── rust_coder_environment.py # Core environment logic
- The Hugging Face Space serves the environment via
uvicorn server.app:app(seeopenenv.yamlandDockerfile). - The built-in OpenEnv web UI may send an empty action on Step; this environment supports that by auto-calling the LLM when
action.codeis empty (unless disabled viaAUTO_LLM_ON_EMPTY_STEP=0). inference.pyis the required baseline runner used by the validator/judge. It connects to the running Space and drivesreset()/step()in a loop, emitting strict[START]/[STEP]/[END]stdout lines.