Skip to content

LeiLiLab/susvibes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SusVibes: Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks

Python 3.11+ Docker 27.2.1+

Overview

SusVibes is a comprehensive benchmark and evaluation pipeline designed to expose the security vulnerabilities in code generated by AI agents when solving real-world software engineering tasks. This framework provides a standardized evaluation pipeline to assess the quality and security implications of agent-generated code across diverse software engineering domains.

The benchmark consists of 200 realistic coding tasks constructed from 108 existing open-source software projects. The agent's solutions are evaluated in terms of functional correctness and security with dynamic tests in execution environments. SusVibes covers a wide range of 77 weaknesses from Common Weakness Enumeration (CWEs). On average, the code base for each task has 160k lines of code, 867 files, and requires over 170 lines of code to solve, providing realistic evaluation for coding agents.

Installation

  1. Clone the repository:
git clone https://github.com/LeiLiLab/susvibes-project.git
cd susvibes-project
  1. Install Python dependencies:
conda create -n sv python=3.11
conda activate sv
pip install -r requirements.txt
  1. The SusVibes dataset is placed under datasets/ directory for convenient usage:
datasets/
└── susvibes_dataset.jsonl

Overview of SusVibes' Dataset

The SusVibes dataset contains task information with the following key fields:

  • instance_id: Unique identifier for each task from real-world projects, formatted as repo-owner__repo-name_commit-id
  • image_name: Pre-built Docker image containing the development environment of each task
  • problem_statement: Natural language description of the task for agent input
  • Other metadata and evaluation specifications are omitted here.

Evaluation Guidelines

Step 1: Harness Coding Agent in Completing SusVibes' Tasks

  1. Prepare the environment:

    • Pull Docker images specified in the image_name field:
    docker pull [image_name]
    • The project code which the task operates on is located at /project within each Docker container
  2. Execute your agent:

    • Feed the problem_statement to your agent
    • Let the agent generate code solutions within the containerized environment
  3. Format predictions: Save your agent's outputs in JSONL format with the following structure:

    {
      "instance_id": "repo-owner__repo-name_commit-id",
      "model_name_or_path": "your-model-name",
      "model_patch": "the-implementation-patch"
    }

For an example guideline on how to run Kimi CLI on SusVibes, see tutorial.

Step 2: Evaluation

Note: SusVibes evaluation can be resource intensive. Recommended hardware settings for an accurate evaluation is to have at least 400GB of free storage, 4GB of RAM and 4 CPU cores per parallel worker on an x86_64 machine.

Run the evaluation pipeline from the src/ directory:

cd src/
python -m run_evaluation \
  --run_id="unique-run-identifier" \
  --predictions_path="path/to/agent/predictions.jsonl" \
  --max_workers="5" \
  --summary_path="path/to/output/summary.json" \
  [--force]  # Optional: force re-evaluation

Parameters:

  • --run_id: Name of this evaluation run
  • --predictions_path: Path to your agent's predictions file
  • --max_workers: Number of parallel workers (adjust based on available CPU cores)
  • --summary_path: Output path for evaluation results
  • --force: Force re-evaluation even if previous logs exist

Advanced Usage

Task Curation Pipeline

See the subfolder's README for more details.

Security-Enhanced Evaluation

SusVibes supports evaluating agents with security-enhanced designs through prompting strategies from different aspects. This feature allows you to prepare datasets that incorporate specific safeguard guidance and evaluate how agents perform under these enhanced circumstances.

1. Prepare Enhanced Dataset

First, enhance the dataset with your chosen safety strategy. This will generate a new dataset file susvibes_dataset_{safety_strategy}.jsonl in the datasets/ directory.

cd src/
python -m run_evaluation \
  --prepare \
  --safety_strategy="your-safety-strategy"

Available Safety Strategies

Strategy Description
generic Provides general security guidelines to the agent without specific vulnerability information
self-selection Allows the agent to select relevant security concerns from a provided list of all possible CWE (Common Weakness Enumeration) types in the dataset
oracle Provides the agent with the exact CWE vulnerabilities that are relevant to the specific task
feedback-driven Iteratively improve the implementation based on feedback from executing security tests. This mode requires --feedback_tool and agent-level integration.

2. Run Agent with Enhanced Dataset

After preparing the security-enhanced dataset, use it to harness your agent instead. Evaluate with the same command as before plus the safety strategy option:

cd src/
python -m run_evaluation \
  --safety_strategy="your-safety-strategy" 
  # ... other parameters

The evaluation will allows you to assess additional security measures and insights in place.

Project Structure

OSS-security/
├── src/
│   ├── curate/           # Task curation
│   ├── env_specs/        # Environment specifications
│   ├── safety_strategies/# Security-enhanced strategies (Advanced Usage)
│   ├── env.py           # Environment management
│   ├── tasks.py         # Task definitions
│   └── run_evaluation.py # Main evaluation entry
├── datasets/            # Dataset storage
├── logs/               # Evaluation logs
├── projects/           # Task repositories
└── requirements.txt    # Dependencies

Troubleshooting

Common Issues

  1. Docker permission errors: Ensure your user has proper Docker permissions
  2. Memory issues: Reduce --max_workers if encountering OOM errors
  3. Storage space: Ensure sufficient disk space for Docker images and logs
  4. Computation power: Allocate enough CPU computation to avoid unexpected evaluation timeouts.

Contributing

We welcome contributions to improve SusVibes! Please see our contributing guidelines for:

  • Adding new tasks
  • Improving evaluation metrics
  • Enhancing security analysis capabilities
  • Documentation improvements

License

This project is licensed under the terms specified in the LICENSE file.

Contact

For questions, issues, or collaboration opportunities, please:

Acknowledgments

We thank the open-source community for providing the diverse codebases used in our benchmark tasks.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published