Skip to content

S1R15H/VibeBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VibeBench 🚀

VibeBench is a research-focused benchmarking framework designed to evaluate and compare AI-based coding assistants across standardized programming tasks. It provides a comprehensive dashboard to generate code, execute it in a controlled environment, and analyze it for correctness, security, and readability.


🌟 Core Features

  • Ollama Cloud Integration: Powered by the latest high-parameter coding models (e.g., Qwen 3 Coder 480B) via the Ollama Cloud API.
  • Split-Pane Interface: A professional dashboard featuring a top configuration bar, side-by-side code/metrics view, and a comparison panel for Execution vs. Expected output.
  • Industry-Standard Metrics: Normalized 0–10 scores based on real static analysis tools:
    • Security: Bandit (Python), Semgrep (JS/PHP), and ShellCheck (Bash).
    • Readability: Radon (Python) and Lizard (Multi-language) for Cyclomatic Complexity.
    • Correctness: Functional verification against task-specific Ground Truths.
  • Task Management: 8 specialized tasks (A-H) covering file I/O, multi-threading, databases, and authentication.

🛠️ Requirements

  • Node.js 18+ & npm
  • Python 3.10+ & pip
  • Ollama Cloud API Key: Required for AI code generation.
  • Static Analysis Tools:
    # Python tools
    pip install bandit semgrep radon lizard
    # Shell tools (on Mac)
    brew install shellcheck

🚀 Quick Start

1. Backend Setup (FastAPI)

cd backend
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Configure Environment
cp .env.example .env
# Edit .env and add your OLLAMA_API_KEY

Start the API:

uvicorn api:app --reload --host 0.0.0.0 --port 8000

2. Frontend Setup (Next.js)

cd frontend
npm install
npm run dev

Open http://localhost:3000 to start benchmarking.


📊 Metrics & Scoring

VibeBench uses a normalized 0–10 scale to provide a clear comparison between models:

Metric Max Score Logic
Correctness 10 10 for success, 5 for partial completion, 0 for failure.
Security 10 Starts at 10; penalties for High (-5), Medium (-2), or Low (-1) issues.
Readability 10 Combines Cyclomatic Complexity (Lizard/Radon) and Comment Density.
Quality Gate Pass/Fail A final check: Requires Status: Success AND no major static quality violations.

📂 Benchmark Tasks (A–H)

Task Category Objective
A File I/O Parse complex CSV data and calculate aggregates.
B Concurrency Multi-threaded JSON processing with data integrity tokens.
C File Writing Generate a deterministic report file from JSON source.
D Data Processing Multi-threaded JSON transformation and checksum calculation.
E Archiving Secure ZIP creation with manifest validation.
F Databases MySQL relational query service with filtering/sorting.
G Databases MongoDB document-query feature with projections.
H Security Modern authentication utility using PBKDF2-HMAC-SHA256.

🏗️ Repository Layout

  • frontend/: Next.js UI built with Tailwind CSS.
  • backend/: FastAPI orchestrator and execution sandbox.
  • backend/scanners/: Integration logic for Bandit, Semgrep, and Lizard.
  • backend/tasks/evaluators.py: The "Source of Truth" for task verification logic.
  • test_data/: Canonical fixtures used as inputs and expected outputs.
  • docs/: Detailed research requirements and architecture documentation.

📝 Notes

  • Sandbox Security: Code is executed using local runtimes. Ensure your environment has the necessary interpreters (Python, Node, PHP, Bash) installed.
  • Research Focus: This project is designed for academic comparison; all prompts and outputs are logged in experiments.db (SQLite) for Meta-Analysis.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors