VibeBench 🚀

VibeBench is a research-focused benchmarking framework designed to evaluate and compare AI-based coding assistants across standardized programming tasks. It provides a comprehensive dashboard to generate code, execute it in a controlled environment, and analyze it for correctness, security, and readability.

🌟 Core Features

Ollama Cloud Integration: Powered by the latest high-parameter coding models (e.g., Qwen 3 Coder 480B) via the Ollama Cloud API.
Split-Pane Interface: A professional dashboard featuring a top configuration bar, side-by-side code/metrics view, and a comparison panel for Execution vs. Expected output.
Industry-Standard Metrics: Normalized 0–10 scores based on real static analysis tools:
- Security: Bandit (Python), Semgrep (JS/PHP), and ShellCheck (Bash).
- Readability: Radon (Python) and Lizard (Multi-language) for Cyclomatic Complexity.
- Correctness: Functional verification against task-specific Ground Truths.
Task Management: 8 specialized tasks (A-H) covering file I/O, multi-threading, databases, and authentication.

🛠️ Requirements

Node.js 18+ & npm
Python 3.10+ & pip
Ollama Cloud API Key: Required for AI code generation.

Static Analysis Tools:

# Python tools
pip install bandit semgrep radon lizard
# Shell tools (on Mac)
brew install shellcheck

🚀 Quick Start

1. Backend Setup (FastAPI)

cd backend
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Configure Environment
cp .env.example .env
# Edit .env and add your OLLAMA_API_KEY

Start the API:

uvicorn api:app --reload --host 0.0.0.0 --port 8000

2. Frontend Setup (Next.js)

cd frontend
npm install
npm run dev

Open http://localhost:3000 to start benchmarking.

📊 Metrics & Scoring

VibeBench uses a normalized 0–10 scale to provide a clear comparison between models:

Metric	Max Score	Logic
Correctness	10	10 for success, 5 for partial completion, 0 for failure.
Security	10	Starts at 10; penalties for High (-5), Medium (-2), or Low (-1) issues.
Readability	10	Combines Cyclomatic Complexity (Lizard/Radon) and Comment Density.
Quality Gate	Pass/Fail	A final check: Requires `Status: Success` AND no major static quality violations.

📂 Benchmark Tasks (A–H)

Task	Category	Objective
A	File I/O	Parse complex CSV data and calculate aggregates.
B	Concurrency	Multi-threaded JSON processing with data integrity tokens.
C	File Writing	Generate a deterministic report file from JSON source.
D	Data Processing	Multi-threaded JSON transformation and checksum calculation.
E	Archiving	Secure ZIP creation with manifest validation.
F	Databases	MySQL relational query service with filtering/sorting.
G	Databases	MongoDB document-query feature with projections.
H	Security	Modern authentication utility using PBKDF2-HMAC-SHA256.

🏗️ Repository Layout

frontend/: Next.js UI built with Tailwind CSS.
backend/: FastAPI orchestrator and execution sandbox.
backend/scanners/: Integration logic for Bandit, Semgrep, and Lizard.
backend/tasks/evaluators.py: The "Source of Truth" for task verification logic.
test_data/: Canonical fixtures used as inputs and expected outputs.
docs/: Detailed research requirements and architecture documentation.

📝 Notes

Sandbox Security: Code is executed using local runtimes. Ensure your environment has the necessary interpreters (Python, Node, PHP, Bash) installed.
Research Focus: This project is designed for academic comparison; all prompts and outputs are logged in experiments.db (SQLite) for Meta-Analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github		.github
backend		backend
docs		docs
frontend		frontend
test_data		test_data
.gitignore		.gitignore
README.md		README.md
TODO_frontend-developer.md		TODO_frontend-developer.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VibeBench 🚀

🌟 Core Features

🛠️ Requirements

🚀 Quick Start

1. Backend Setup (FastAPI)

2. Frontend Setup (Next.js)

📊 Metrics & Scoring

📂 Benchmark Tasks (A–H)

🏗️ Repository Layout

📝 Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VibeBench 🚀

🌟 Core Features

🛠️ Requirements

🚀 Quick Start

1. Backend Setup (FastAPI)

2. Frontend Setup (Next.js)

📊 Metrics & Scoring

📂 Benchmark Tasks (A–H)

🏗️ Repository Layout

📝 Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages