SusVibes is a comprehensive benchmark and evaluation pipeline designed to expose the security vulnerabilities in code generated by AI agents when solving real-world software engineering tasks. This framework provides a standardized evaluation pipeline to assess the quality and security implications of agent-generated code across diverse software engineering domains.
The benchmark consists of 200 realistic coding tasks constructed from 108 existing open-source software projects. The agent's solutions are evaluated in terms of functional correctness and security with dynamic tests in execution environments. SusVibes covers a wide range of 77 weaknesses from Common Weakness Enumeration (CWEs). On average, the code base for each task has 160k lines of code, 867 files, and requires over 170 lines of code to solve, providing realistic evaluation for coding agents.
- Clone the repository:
git clone https://github.com/LeiLiLab/susvibes-project.git
cd susvibes-project- Install Python dependencies:
conda create -n sv python=3.11
conda activate sv
pip install -r requirements.txt- The SusVibes dataset is placed under
datasets/directory for convenient usage:
datasets/
└── susvibes_dataset.jsonlThe SusVibes dataset contains task information with the following key fields:
instance_id: Unique identifier for each task from real-world projects, formatted asrepo-owner__repo-name_commit-idimage_name: Pre-built Docker image containing the development environment of each taskproblem_statement: Natural language description of the task for agent input- Other metadata and evaluation specifications are omitted here.
-
Prepare the environment:
- Pull Docker images specified in the
image_namefield:
docker pull [image_name]
- The project code which the task operates on is located at
/projectwithin each Docker container
- Pull Docker images specified in the
-
Execute your agent:
- Feed the
problem_statementto your agent - Let the agent generate code solutions within the containerized environment
- Feed the
-
Format predictions: Save your agent's outputs in JSONL format with the following structure:
{ "instance_id": "repo-owner__repo-name_commit-id", "model_name_or_path": "your-model-name", "model_patch": "the-implementation-patch" }
For an example guideline on how to run Kimi CLI on SusVibes, see tutorial.
Note: SusVibes evaluation can be resource intensive. Recommended hardware settings for an accurate evaluation is to have at least 400GB of free storage, 4GB of RAM and 4 CPU cores per parallel worker on an
x86_64machine.
Run the evaluation pipeline from the src/ directory:
cd src/
python -m run_evaluation \
--run_id="unique-run-identifier" \
--predictions_path="path/to/agent/predictions.jsonl" \
--max_workers="5" \
--summary_path="path/to/output/summary.json" \
[--force] # Optional: force re-evaluation--run_id: Name of this evaluation run--predictions_path: Path to your agent's predictions file--max_workers: Number of parallel workers (adjust based on available CPU cores)--summary_path: Output path for evaluation results--force: Force re-evaluation even if previous logs exist
See the subfolder's README for more details.
SusVibes supports evaluating agents with security-enhanced designs through prompting strategies from different aspects. This feature allows you to prepare datasets that incorporate specific safeguard guidance and evaluate how agents perform under these enhanced circumstances.
First, enhance the dataset with your chosen safety strategy. This will generate a new dataset file susvibes_dataset_{safety_strategy}.jsonl in the datasets/ directory.
cd src/
python -m run_evaluation \
--prepare \
--safety_strategy="your-safety-strategy"| Strategy | Description |
|---|---|
generic |
Provides general security guidelines to the agent without specific vulnerability information |
self-selection |
Allows the agent to select relevant security concerns from a provided list of all possible CWE (Common Weakness Enumeration) types in the dataset |
oracle |
Provides the agent with the exact CWE vulnerabilities that are relevant to the specific task |
feedback-driven |
Iteratively improve the implementation based on feedback from executing security tests. This mode requires --feedback_tool and agent-level integration. |
After preparing the security-enhanced dataset, use it to harness your agent instead. Evaluate with the same command as before plus the safety strategy option:
cd src/
python -m run_evaluation \
--safety_strategy="your-safety-strategy"
# ... other parametersThe evaluation will allows you to assess additional security measures and insights in place.
OSS-security/
├── src/
│ ├── curate/ # Task curation
│ ├── env_specs/ # Environment specifications
│ ├── safety_strategies/# Security-enhanced strategies (Advanced Usage)
│ ├── env.py # Environment management
│ ├── tasks.py # Task definitions
│ └── run_evaluation.py # Main evaluation entry
├── datasets/ # Dataset storage
├── logs/ # Evaluation logs
├── projects/ # Task repositories
└── requirements.txt # Dependencies
- Docker permission errors: Ensure your user has proper Docker permissions
- Memory issues: Reduce
--max_workersif encountering OOM errors - Storage space: Ensure sufficient disk space for Docker images and logs
- Computation power: Allocate enough CPU computation to avoid unexpected evaluation timeouts.
We welcome contributions to improve SusVibes! Please see our contributing guidelines for:
- Adding new tasks
- Improving evaluation metrics
- Enhancing security analysis capabilities
- Documentation improvements
This project is licensed under the terms specified in the LICENSE file.
For questions, issues, or collaboration opportunities, please:
- Open an issue on GitHub
- Contact the maintainers at songwenz@andrew.cmu.edu
We thank the open-source community for providing the diverse codebases used in our benchmark tasks.