SusVibes: Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks

Overview

SusVibes is a comprehensive benchmark and evaluation pipeline designed to expose the security vulnerabilities in code generated by AI agents when solving real-world software engineering tasks. This framework provides a standardized evaluation pipeline to assess the quality and security implications of agent-generated code across diverse software engineering domains.

The benchmark consists of 200 realistic coding tasks constructed from 108 existing open-source software projects. The agent's solutions are evaluated in terms of functional correctness and security with dynamic tests in execution environments. SusVibes covers a wide range of 77 weaknesses from Common Weakness Enumeration (CWEs). On average, the code base for each task has 160k lines of code, 867 files, and requires over 170 lines of code to solve, providing realistic evaluation for coding agents.

Installation

Clone the repository:

git clone https://github.com/LeiLiLab/susvibes-project.git
cd susvibes-project

Install Python dependencies:

conda create -n sv python=3.11
conda activate sv
pip install -r requirements.txt

The SusVibes dataset is placed under datasets/ directory for convenient usage:

datasets/
└── susvibes_dataset.jsonl

Overview of SusVibes' Dataset

The SusVibes dataset contains task information with the following key fields:

instance_id: Unique identifier for each task from real-world projects, formatted as repo-owner__repo-name_commit-id
image_name: Pre-built Docker image containing the development environment of each task
problem_statement: Natural language description of the task for agent input
Other metadata and evaluation specifications are omitted here.

Evaluation Guidelines

Step 1: Harness Coding Agent in Completing SusVibes' Tasks

Prepare the environment:
- Pull Docker images specified in the image_name field:
```
docker pull [image_name]
```
- The project code which the task operates on is located at /project within each Docker container
Execute your agent:
- Feed the problem_statement to your agent
- Let the agent generate code solutions within the containerized environment

Format predictions: Save your agent's outputs in JSONL format with the following structure:

{
  "instance_id": "repo-owner__repo-name_commit-id",
  "model_name_or_path": "your-model-name",
  "model_patch": "the-implementation-patch"
}

For an example guideline on how to run Kimi CLI on SusVibes, see tutorial.

Step 2: Evaluation

Note: SusVibes evaluation can be resource intensive. Recommended hardware settings for an accurate evaluation is to have at least 400GB of free storage, 4GB of RAM and 4 CPU cores per parallel worker on an x86_64 machine.

Run the evaluation pipeline from the src/ directory:

cd src/
python -m run_evaluation \
  --run_id="unique-run-identifier" \
  --predictions_path="path/to/agent/predictions.jsonl" \
  --max_workers="5" \
  --summary_path="path/to/output/summary.json" \
  [--force]  # Optional: force re-evaluation

Parameters:

--run_id: Name of this evaluation run
--predictions_path: Path to your agent's predictions file
--max_workers: Number of parallel workers (adjust based on available CPU cores)
--summary_path: Output path for evaluation results
--force: Force re-evaluation even if previous logs exist

Advanced Usage

Task Curation Pipeline

See the subfolder's README for more details.

Security-Enhanced Evaluation

SusVibes supports evaluating agents with security-enhanced designs through prompting strategies from different aspects. This feature allows you to prepare datasets that incorporate specific safeguard guidance and evaluate how agents perform under these enhanced circumstances.

1. Prepare Enhanced Dataset

First, enhance the dataset with your chosen safety strategy. This will generate a new dataset file susvibes_dataset_{safety_strategy}.jsonl in the datasets/ directory.

cd src/
python -m run_evaluation \
  --prepare \
  --safety_strategy="your-safety-strategy"

Available Safety Strategies

Strategy	Description
`generic`	Provides general security guidelines to the agent without specific vulnerability information
`self-selection`	Allows the agent to select relevant security concerns from a provided list of all possible CWE (Common Weakness Enumeration) types in the dataset
`oracle`	Provides the agent with the exact CWE vulnerabilities that are relevant to the specific task
`feedback-driven`	Iteratively improve the implementation based on feedback from executing security tests. This mode requires `--feedback_tool` and agent-level integration.

2. Run Agent with Enhanced Dataset

After preparing the security-enhanced dataset, use it to harness your agent instead. Evaluate with the same command as before plus the safety strategy option:

cd src/
python -m run_evaluation \
  --safety_strategy="your-safety-strategy" 
  # ... other parameters

The evaluation will allows you to assess additional security measures and insights in place.

Project Structure

OSS-security/
├── src/
│   ├── curate/           # Task curation
│   ├── env_specs/        # Environment specifications
│   ├── safety_strategies/# Security-enhanced strategies (Advanced Usage)
│   ├── env.py           # Environment management
│   ├── tasks.py         # Task definitions
│   └── run_evaluation.py # Main evaluation entry
├── datasets/            # Dataset storage
├── logs/               # Evaluation logs
├── projects/           # Task repositories
└── requirements.txt    # Dependencies

Troubleshooting

Common Issues

Docker permission errors: Ensure your user has proper Docker permissions
Memory issues: Reduce --max_workers if encountering OOM errors
Storage space: Ensure sufficient disk space for Docker images and logs
Computation power: Allocate enough CPU computation to avoid unexpected evaluation timeouts.

Contributing

We welcome contributions to improve SusVibes! Please see our contributing guidelines for:

Adding new tasks
Improving evaluation metrics
Enhancing security analysis capabilities
Documentation improvements

License

This project is licensed under the terms specified in the LICENSE file.

Contact

For questions, issues, or collaboration opportunities, please:

Open an issue on GitHub
Contact the maintainers at songwenz@andrew.cmu.edu

Acknowledgments

We thank the open-source community for providing the diverse codebases used in our benchmark tasks.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
datasets		datasets
evaluation_harness		evaluation_harness
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SusVibes: Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks

Overview

Installation

Overview of SusVibes' Dataset

Evaluation Guidelines

Step 1: Harness Coding Agent in Completing SusVibes' Tasks

Step 2: Evaluation

Parameters:

Advanced Usage

Task Curation Pipeline

Security-Enhanced Evaluation

1. Prepare Enhanced Dataset

Available Safety Strategies

2. Run Agent with Enhanced Dataset

Project Structure

Troubleshooting

Common Issues

Contributing

License

Contact

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

LeiLiLab/susvibes

Folders and files

Latest commit

History

Repository files navigation

SusVibes: Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks

Overview

Installation

Overview of SusVibes' Dataset

Evaluation Guidelines

Step 1: Harness Coding Agent in Completing SusVibes' Tasks

Step 2: Evaluation

Parameters:

Advanced Usage

Task Curation Pipeline

Security-Enhanced Evaluation

1. Prepare Enhanced Dataset

Available Safety Strategies

2. Run Agent with Enhanced Dataset

Project Structure

Troubleshooting

Common Issues

Contributing

License

Contact

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages