This repository contains all data and analysis results for evaluating different prompting-based and fine tuning methods to fix and repair AI-generated code that contained security weaknesses (CWEs). This was done across multiple LLMs, programming languages, and CWE-based scenarios.
Click here to access the preprint:
To replicate the study, the following software and resources are required:
- Visual Studio Code (VS Code)
- Note: The automation scripts used in this study were developed and confirmed to be working as of January 2026. Newer VS Code versions may require minor adjustments to the scripts to account for changes in the user interface.
- Access to the following LLM models and their fine-tuning platforms:
- GPT-4.1
- GPT-5
- Gemini 2.0 Flash
- o4-mini
- DeepSeek-R1-32B (local setup)
CodeQL CLI (2.21.0)and the CodeQL default security query packs- Python, Java, JavaScript, and Go toolchains installed
Clone the repository and install dependencies:
git clone https://github.com/AliSoltanianFJ/CodeSecurity2025
cd CodeSecurity2025/
The repository is structured as follows:
CodeSecurity2025/
├── Scenarios
│ ├── Go
│ ├── Java
│ ├── JavaScript
│ └── Python
└── Scripts
├── Results
└── CWEsIntroducedMapping
Each programming language contains identical scenario folder structure.
<Language>/
└── <Model>/
└── Scenarios/
├── ScenarioX/
│ ├── CopilotRaw
│ ├── Idea1
│ ├── Idea2
│ ├── Idea3
│ └── Idea4
Where:
| Folder | Meaning |
|---|---|
| CopilotRaw | Baseline model output (no refinement) |
| Idea1 | Negative Example Prompting (NEP) |
| Idea2 | Chain-of-Thought Prompting (CoT) |
| Idea3 | Fine-Tuned model outputs |
| Idea4 | Meta Prompting (MP) |
Each <Model> directory contains:
results.sarif→ The code scanning results from CodeQL with all detected CWEs recorded.
Each ScenarioX directory contains:
prompts.txt→ The prompts used for the baseline and refinement technique prompts for that scenarioresults.csv→ A spreadsheet of results documenting all the CWEs detected in each code sample for that scenario (including results for the original raw samples, and each refinement technique)- Small scenario scripts (named
automateCopilot.py) used to execute prompts and store generated code.
The Results directory includes an overview of the results, and the CWEsIntroducedMapping includes network diagrams of original CWEs vs any CWEs introduced after applying refinement techniques.
The Scripts directory includes the custom Python-based script (go_custom_code_scanning.py) used to detect CWEs in Go code that CodeQl fails to detect.
- Open the repository in the terminal.
- For each language, locate the scenarios:
Scenarios/<Language>/Model/Scenarios/ScenarioX/
Each scenario corresponds to one weakness derived from the MITRE CWE Top 25.
- Ensure the following dependencies are installed if using the
automateCopilot.pyscripts:
- pyperclip==1.8.2
- psutil==6.1.1
- pyautogui==0.9.54
- pywinauto==0.6.9
For each model, language, and scenario:
- Open the
prompt.txtfile in the correspondingScenarioXfolder. - Submit the prompt to the target LLM via Copilot in Visual Studio Code.
- Save the generated code into the CopilotRaw folder for that scenario.
This produces the baseline insecure code samples used for RQ1.
Alternatively, the automateCopilot.py file in the corresponding ScenarioX folder can be used to automate this process.
For each scenario, repeat the generation process using the alternative prompts stored in the same scenario directory.
For every model × language × scenario:
- Locate the prompts inside
ScenarioX/prompts.txt. - Generate new code using each prompting-based refinement technique:
- Negative Example Prompting → save to
Idea1 - Chain-of-Thought Prompting → save to
Idea2 - Meta Prompting → save to
Idea4
- Negative Example Prompting → save to
- After generation:
- Save the generated code into the the corresponding folder for that scenario.
This step produces the samples with the prompting-based refinement techniques applied for RQ2.
Alternatively, the automateCopilot.py file in the corresponding ScenarioX folder can be used to automate this process.
All generated code was analysed using CodeQL default security packs.
For each generated sample:
- Initialize a CodeQL database for the language.
- Run the default security query suite:
| Language | CodeQL Query Path |
|---|---|
| Python | codeql-repo/python/ql/src/Security |
| Java | codeql-repo/java/ql/src/Security |
| JavaScript | codeql-repo/javascript/ql/src/Security |
| Go | codeql-repo/go/ql/src/Security |
- Export the detected CWEs.
- Store results as .sarif files:
results.sarif
- When running the security analysis for Go code, also run
go_custom_code_scanning.pyscript provided in theScriptsdirectory of this repository.
Note: For Java, run the compile-all.bat file in the Scenarios\Java directory to compile all generated Java files before scanning with CodeQL.
Note 2: In the Scripts folder of the repository, four .txt files are included which contain scripts for running CodeQL scanning for each of the four programming languages. The scripts assume that the CodeQL tool files are included in the same directory as the repository. The model for which the script is run for can be set in the areas of the script labelled with "<MODEL_NAME>". The scripts can be renamed to the .bat file extension to be executed.
This step was repeated for:
- Baseline outputs
- All prompting-based refinement techniques
- Fine-tuned model outputs
Fine-tuning datasets were prepared separately for each language.
For each supported model:
- Upload the language-specific dataset to the model's cloud fine-tuning platform.
- Perform LoRA fine-tuning using default platform configurations.
- Re-run all scenario prompts using the fine-tuned model.
- Save outputs in the
Idea3folders. - Run CodeQL analysis again.
To fine-tune the DeepSeek R1 32B model, a system with the following specifications was used and is recommended for fine-tuning:
CPU - Intel Xeon Gold 6242R CPU (16 Cores, 3.10GHz)
GPU - Tesla T4 (16GB of GDDR6 Memory)
RAM - 500GB
Disk Size - 200GB
Operating System - Ubuntu 24.04.3.
The following libraries were used to fine-tune DeepSeek:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
Run the fine-tuning process for 5 epochs.
Finally:
- Collect all CodeQL outputs.
- Map detected weaknesses to CWE IDs.
- Write down results in .csv files
- Calculate CWE severity for all scenarios and calculate percentage difference between baseline and refined samples
- Calculate percentage difference in the number of CWEs between baseline and refined samples
- Note any CWEs introduced after applying model output refinement techniques.
These results were used to answer RQ1-RQ3.
- Exact outputs (code samples) may vary slightly due to nondeterminism in LLM generation.
- API and platform updates may require minor adjustments to the full process.
- Minor script updates may be needed for newer VS Code or CodeQL versions.
@misc{soltanianfardjahromi2026securecode,
title = {On Fixing Insecure AI-Generated Code through Model Fine-Tuning and Prompting Strategies},
author = {Soltanian Fard Jahromi, Ali and Tahir, Amjed and Liang, Peng and Khomh, Foutse},
year = {2026},
journal = {Submitted to ACM Transactions on Software Engineering and Methodology}
}