In cybersecurity, proactive strategies aimed at identifying vulnerabilities before they can be exploited play a major role. Among these strategies is penetration testing which, despite its importance, remains a complex process that requires strong technical expertise and a significant time investment. One of the most critical phases is exploitation, during which a Proof of Concept (PoC) is developed to concretely verify whether an identified vulnerability is actually exploitable.
This work presents CVExploit, a semi-automated multi-agent framework designed to automate this phase through the use of a Large Language Model (LLM). A pipeline was designed to automatically generate an exploit starting from the information associated with a vulnerability's CVE and the details provided about the target system. The architecture integrates problem decomposition, validation, and code refinement mechanisms to improve the robustness and reliability of the generation process.
The framework was evaluated on a set of 32 CVEs, achieving an overall success rate of 65.6%. The results show that integrating an LLM into a carefully designed architecture can provide concrete support for exploitation activities in penetration testing.
The source code for the CVExploit framework and the results collected during the evaluation are available in this GitHub repository.
├── README.md
├── .env.example # Example template for the .env file
├── config_info.yaml # File containing the framework configuration parameters
├── Ablation_Study/ # Implementation of the different pipelines tested in the ablation study
├── CVE/ # Vulnerable environments for the tested CVEs
├── Executors/ # Isolated environment for exploit execution
│ └── Python-Docker-Executor/
├── Pipeline/ # Implementation of the framework's main pipeline
├── Results/ # Results obtained while testing the framework
└── Setup/ # Scripts and files for setting up the working environment
Before getting started, make sure you have a Windows or Linux machine configured with the following software:
- WSL (Windows Subsystem for Linux): Mandatory on Windows to use
searchsploitand supportDocker. - Docker: Required to manage both the vulnerable environment and the attack environment.
- Note for Linux users: Docker must be configured to run without root privileges (
sudo). Run the command below and restart your session or machine:sudo usermod -aG docker $USER
- Note for Linux users: Docker must be configured to run without root privileges (
- Ollama: Engine used for the embedding model and local models.
- Note: If the service is hosted on a remote machine, see section 1
API Keys and ServicesforOLLAMA_HOST. In that case, a local installation is not required.
- Note: If the service is hosted on a remote machine, see section 1
- Python and PIP: Version 3.11 or later is recommended.
Once the prerequisites are satisfied, proceed with the working environment setup.
- Note for Linux users: From the project root, you must assign execution permissions to the scripts:
sudo chmod -R 755 .
- Rename
.env.exampleto.envand open it with a text editor. Its content should be:
GOOGLE_API_KEY=insert_key_here
GROQ_API_KEY=insert_key_here
OLLAMA_HOST=http://localhost:11434
GITHUB_TOKEN=insert_token_here
LANGSMITH_API_KEY=insert_key_here
- Configure the API keys and services according to these rules:
| API Key / Service | Rule |
|---|---|
GOOGLE_API_KEY |
Required if you use GOOGLE as the LLM provider |
GROQ_API_KEY |
Required if you use GROQ as the LLM provider |
OLLAMA_HOST |
Always required. HTTP endpoint of the Ollama service. Change it only if Ollama is not running locally, specifying the IP address or hostname and the port on which the service is listening |
GITHUB_TOKEN |
Always required |
LANGSMITH_API_KEY |
Required only if you want to enable tracing through LangSmith |
ℹ️ Click here to learn how to obtain the API keys and tokens
1. GOOGLE_API_KEY
- Go to Google AI Studio at Get API key.
- Click
Create API Key.- Fill in the requested information, click
Create key, and copy the generated string.2. GROQ_API_KEY
- Go to GROQ at API Keys.
- Click
Create API Key.- Fill in the requested information, click
Submit, and copy the generated string.3. GITHUB_TOKEN
- Go to GitHub at Developer Settings > Personal access tokens.
- Click
Generate new token (classic).- Assign a name, for example
GIT-REPO-token, set anExpiration Date, and selectpublic_repopermissions.- Click
Generate tokenand copy the generated string.4. LANGSMITH_API_KEY
- Go to LangSmith at Settings > API Keys.
- Click
+ API Keyin the top-right corner.- Fill in the
descriptionfield, select Personal Access Token as thekey type, configure theworkspace, and choose anExpiration Date.- Click
Create API Keyand copy the generated string.
The config_info.yaml file contains the parameters required for the pipeline to operate. You can modify them based on your needs:
| Parameter | Default Value | Description |
|---|---|---|
| TARGET CVE | ||
CVE_ID |
CVE-2014-6271 |
Identifier of the CVE you want to try to exploit. |
| PROVIDER AND LLM SETTINGS | ||
PROVIDER |
google |
LLM provider. Supported options: google, groq, and ollama. |
NAME_MODEL |
gemini-2.5-flash |
Specific model name to use. |
TEMPERATURE |
0.0 |
Degree of model creativity. 0.0 enables deterministic and reproducible results. |
RETRIES |
5 |
Number of model retries in case of errors. |
RATE_LIMITER_REQUESTS_PER_MINUTE |
1000 |
Requests-per-minute limit to avoid rate-limiting errors from Google and Groq providers. |
INPUT_TOKEN_PER_MILLION |
0.30 |
Cost in USD for one million input tokens for the selected model. |
OUTPUT_TOKEN_PER_MILLION |
2.50 |
Cost in USD for one million output tokens for the selected model. |
MAX_TOKENS |
1000000 |
Maximum number of input tokens the model can process. |
SAFETY_MARGIN |
1000 |
Safety margin to avoid exceeding the model's maximum input token limit. |
| DOCUMENT RETRIEVAL AND CHUNKING SETTINGS (RAG) | ||
CHUNK_SIZE |
1024 |
Size of the text chunks into which documents are split. |
CHUNK_OVERLAP |
100 |
Number of overlapping characters between adjacent chunks. |
NUMBER_CHUNK |
10 |
Maximum number of relevant text chunks that can be retrieved. |
EMBEDDING_MODEL |
embeddinggemma:300m |
Model used through Ollama to generate document embeddings. You can specify a different model that was previously downloaded through Ollama. |
| EXECUTION AND VALIDATION PARAMETERS | ||
MAX_CVE_URL_DOCUMENT |
5 |
Number of documents related to the target CVE that is considered sufficient to avoid analyzing documents for similar CVEs as well. |
MAX_VALIDATION_RETRIES |
3 |
Maximum number of attempts for validation checks. |
MAX_REFINEMENT_CYCLES |
6 |
Maximum number of automatic corrections after exploit generation. |
| TOOLS | ||
EXTERNAL_TOOLS |
jmet-0.1.0-all.jar, ysoserial-all.jar, marshalsec-0.0.3-SNAPSHOT-all.jar, ColdFusionPwn-0.0.1-SNAPSHOT-all.jar |
Tools that the LLM can use inside the exploit code. To add a tool, insert its name in this list and copy the related file into Executors/Python-Docker-Executor/TOOL/. |
Warning: if you change the value of NAME_MODEL, you must also update MAX_TOKENS, RATE_LIMITER_REQUESTS_PER_MINUTE, INPUT_TOKEN_PER_MILLION, and OUTPUT_TOKEN_PER_MILLION accordingly.
It is good practice to isolate project dependencies. From the project root, run:
python3 -m venv venvThen activate the virtual environment:
- Windows:
.\venv\Scripts\activate
- Linux:
source venv/bin/activate
Once the virtual environment is active, run the setup script. This command installs the Python libraries, configures searchsploit, and downloads the default embedding model.
python3 Setup/setup.pyCheck the README of the selected CVE in the corresponding subdirectory inside CVE, especially the Vulnerable Environment Setup and Post-Execution Framework sections, to understand whether any action is required before starting the framework and how to verify whether the exploit succeeded.
To start CVExploit using the pipeline shown in the figure, run:
python3 Pipeline/main.pyIf you want to reproduce the ablation-study experiments instead, run:
python3 Ablation_Study/main.pyIn that case, when the framework starts you will be asked to select which pipeline to use:
- Pipeline 1: does not include validation checks, code refinement/correction, or the ability for the user to provide input.
- Pipeline 2: includes validation checks, but does not include code refinement/correction or user input.
- Pipeline 3: includes code refinement/correction, but does not include validation checks or user input.
- Pipeline 4: includes both validation checks and code refinement/correction, but does not allow user input.
- Pipeline 5: corresponds to the complete configuration shown in the figure. It integrates validation checks, code refinement/correction, and the ability for the user to provide input.