VibeBench is a research-focused benchmarking framework designed to evaluate and compare AI-based coding assistants across standardized programming tasks. It provides a comprehensive dashboard to generate code, execute it in a controlled environment, and analyze it for correctness, security, and readability.
- Ollama Cloud Integration: Powered by the latest high-parameter coding models (e.g., Qwen 3 Coder 480B) via the Ollama Cloud API.
- Split-Pane Interface: A professional dashboard featuring a top configuration bar, side-by-side code/metrics view, and a comparison panel for Execution vs. Expected output.
- Industry-Standard Metrics: Normalized 0–10 scores based on real static analysis tools:
- Security: Bandit (Python), Semgrep (JS/PHP), and ShellCheck (Bash).
- Readability: Radon (Python) and Lizard (Multi-language) for Cyclomatic Complexity.
- Correctness: Functional verification against task-specific Ground Truths.
- Task Management: 8 specialized tasks (A-H) covering file I/O, multi-threading, databases, and authentication.
- Node.js 18+ & npm
- Python 3.10+ & pip
- Ollama Cloud API Key: Required for AI code generation.
- Static Analysis Tools:
# Python tools pip install bandit semgrep radon lizard # Shell tools (on Mac) brew install shellcheck
cd backend
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Configure Environment
cp .env.example .env
# Edit .env and add your OLLAMA_API_KEYStart the API:
uvicorn api:app --reload --host 0.0.0.0 --port 8000cd frontend
npm install
npm run devOpen http://localhost:3000 to start benchmarking.
VibeBench uses a normalized 0–10 scale to provide a clear comparison between models:
| Metric | Max Score | Logic |
|---|---|---|
| Correctness | 10 | 10 for success, 5 for partial completion, 0 for failure. |
| Security | 10 | Starts at 10; penalties for High (-5), Medium (-2), or Low (-1) issues. |
| Readability | 10 | Combines Cyclomatic Complexity (Lizard/Radon) and Comment Density. |
| Quality Gate | Pass/Fail | A final check: Requires Status: Success AND no major static quality violations. |
| Task | Category | Objective |
|---|---|---|
| A | File I/O | Parse complex CSV data and calculate aggregates. |
| B | Concurrency | Multi-threaded JSON processing with data integrity tokens. |
| C | File Writing | Generate a deterministic report file from JSON source. |
| D | Data Processing | Multi-threaded JSON transformation and checksum calculation. |
| E | Archiving | Secure ZIP creation with manifest validation. |
| F | Databases | MySQL relational query service with filtering/sorting. |
| G | Databases | MongoDB document-query feature with projections. |
| H | Security | Modern authentication utility using PBKDF2-HMAC-SHA256. |
frontend/: Next.js UI built with Tailwind CSS.backend/: FastAPI orchestrator and execution sandbox.backend/scanners/: Integration logic for Bandit, Semgrep, and Lizard.backend/tasks/evaluators.py: The "Source of Truth" for task verification logic.test_data/: Canonical fixtures used as inputs and expected outputs.docs/: Detailed research requirements and architecture documentation.
- Sandbox Security: Code is executed using local runtimes. Ensure your environment has the necessary interpreters (Python, Node, PHP, Bash) installed.
- Research Focus: This project is designed for academic comparison; all prompts and outputs are logged in
experiments.db(SQLite) for Meta-Analysis.