Fixing Insecure AI-Generated Code through Model Fine-Tuning and Prompting Strategies

Description

This repository contains all data and analysis results for evaluating different prompting-based and fine tuning methods to fix and repair AI-generated code that contained security weaknesses (CWEs). This was done across multiple LLMs, programming languages, and CWE-based scenarios.

Click here to access the preprint:

Prerequisites

To replicate the study, the following software and resources are required:

Visual Studio Code (VS Code)
- Note: The automation scripts used in this study were developed and confirmed to be working as of January 2026. Newer VS Code versions may require minor adjustments to the scripts to account for changes in the user interface.
Access to the following LLM models and their fine-tuning platforms:
- GPT-4.1
- GPT-5
- Gemini 2.0 Flash
- o4-mini
- DeepSeek-R1-32B (local setup)
CodeQL CLI (2.21.0) and the CodeQL default security query packs
Python, Java, JavaScript, and Go toolchains installed

Clone the repository and install dependencies:

git clone https://github.com/AliSoltanianFJ/CodeSecurity2025
cd CodeSecurity2025/

Repository Overview

The repository is structured as follows:

CodeSecurity2025/
├── Scenarios
│   ├── Go
│   ├── Java
│   ├── JavaScript
│   └── Python
└── Scripts
├── Results
    └── CWEsIntroducedMapping

Each programming language contains identical scenario folder structure.

Per-Language Structure

<Language>/
└── <Model>/
    └── Scenarios/
        ├── ScenarioX/
        │   ├── CopilotRaw
        │   ├── Idea1
        │   ├── Idea2
        │   ├── Idea3
        │   └── Idea4

Where:

Folder	Meaning
CopilotRaw	Baseline model output (no refinement)
Idea1	Negative Example Prompting (NEP)
Idea2	Chain-of-Thought Prompting (CoT)
Idea3	Fine-Tuned model outputs
Idea4	Meta Prompting (MP)

Each <Model> directory contains:

results.sarif → The code scanning results from CodeQL with all detected CWEs recorded.

Each ScenarioX directory contains:

prompts.txt → The prompts used for the baseline and refinement technique prompts for that scenario
results.csv → A spreadsheet of results documenting all the CWEs detected in each code sample for that scenario (including results for the original raw samples, and each refinement technique)
Small scenario scripts (named automateCopilot.py) used to execute prompts and store generated code.

The Results directory includes an overview of the results, and the CWEsIntroducedMapping includes network diagrams of original CWEs vs any CWEs introduced after applying refinement techniques.

The Scripts directory includes the custom Python-based script (go_custom_code_scanning.py) used to detect CWEs in Go code that CodeQl fails to detect.

Steps

Step 1 - Scenario Preparation

Open the repository in the terminal.
For each language, locate the scenarios:

Scenarios/<Language>/Model/Scenarios/ScenarioX/

Each scenario corresponds to one weakness derived from the MITRE CWE Top 25.

Ensure the following dependencies are installed if using the automateCopilot.py scripts:
- pyperclip==1.8.2
- psutil==6.1.1
- pyautogui==0.9.54
- pywinauto==0.6.9

Step 2 - Baseline Code Generation (CopilotRaw)

For each model, language, and scenario:

Open the prompt.txt file in the corresponding ScenarioX folder.
Submit the prompt to the target LLM via Copilot in Visual Studio Code.
Save the generated code into the CopilotRaw folder for that scenario.

This produces the baseline insecure code samples used for RQ1.

Alternatively, the automateCopilot.py file in the corresponding ScenarioX folder can be used to automate this process.

Step 3 - Prompt-Based Refinement Experiments

For each scenario, repeat the generation process using the alternative prompts stored in the same scenario directory.

For every model × language × scenario:

Locate the prompts inside ScenarioX/prompts.txt.
Generate new code using each prompting-based refinement technique:
- Negative Example Prompting → save to Idea1
- Chain-of-Thought Prompting → save to Idea2
- Meta Prompting → save to Idea4
After generation:
- Save the generated code into the the corresponding folder for that scenario.

This step produces the samples with the prompting-based refinement techniques applied for RQ2.

Alternatively, the `automateCopilot.py` file in the corresponding `ScenarioX` folder can be used to automate this process.

Step 4 - CodeQL Security Analysis

All generated code was analysed using CodeQL default security packs.

For each generated sample:

Initialize a CodeQL database for the language.
Run the default security query suite:

Language	CodeQL Query Path
Python	`codeql-repo/python/ql/src/Security`
Java	`codeql-repo/java/ql/src/Security`
JavaScript	`codeql-repo/javascript/ql/src/Security`
Go	`codeql-repo/go/ql/src/Security`

Export the detected CWEs.
Store results as .sarif files:

results.sarif

When running the security analysis for Go code, also run go_custom_code_scanning.py script provided in the Scripts directory of this repository.

Note: For Java, run the compile-all.bat file in the Scenarios\Java directory to compile all generated Java files before scanning with CodeQL. Note 2: In the Scripts folder of the repository, four .txt files are included which contain scripts for running CodeQL scanning for each of the four programming languages. The scripts assume that the CodeQL tool files are included in the same directory as the repository. The model for which the script is run for can be set in the areas of the script labelled with "<MODEL_NAME>". The scripts can be renamed to the .bat file extension to be executed.

This step was repeated for:

Baseline outputs
All prompting-based refinement techniques
Fine-tuned model outputs

Step 5 - Fine-Tuning (LoRA)

Fine-tuning datasets were prepared separately for each language.

For each supported model:

Upload the language-specific dataset to the model's cloud fine-tuning platform.
Perform LoRA fine-tuning using default platform configurations.
Re-run all scenario prompts using the fine-tuned model.
Save outputs in the Idea3 folders.
Run CodeQL analysis again.

DeepSeek R1 32B Fine-Tuning

To fine-tune the DeepSeek R1 32B model, a system with the following specifications was used and is recommended for fine-tuning:

CPU - Intel Xeon Gold 6242R CPU (16 Cores, 3.10GHz)

GPU - Tesla T4 (16GB of GDDR6 Memory)

RAM - 500GB

Disk Size - 200GB

Operating System - Ubuntu 24.04.3.

The following libraries were used to fine-tune DeepSeek:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer

Run the fine-tuning process for 5 epochs.

Step 6 - Collating Results

Finally:

Collect all CodeQL outputs.
Map detected weaknesses to CWE IDs.
Write down results in .csv files
Calculate CWE severity for all scenarios and calculate percentage difference between baseline and refined samples
Calculate percentage difference in the number of CWEs between baseline and refined samples
Note any CWEs introduced after applying model output refinement techniques.

These results were used to answer RQ1-RQ3.

Notes on reproducibility

Exact outputs (code samples) may vary slightly due to nondeterminism in LLM generation.
API and platform updates may require minor adjustments to the full process.
Minor script updates may be needed for newer VS Code or CodeQL versions.

✏️ Citation

@misc{soltanianfardjahromi2026securecode,
  title   = {On Fixing Insecure AI-Generated Code through Model Fine-Tuning and Prompting Strategies},
  author  = {Soltanian Fard Jahromi, Ali and Tahir, Amjed and Liang, Peng and Khomh, Foutse},
  year    = {2026},
  journal = {Submitted to ACM Transactions on Software Engineering and Methodology}
}

Name		Name	Last commit message	Last commit date
Latest commit History 317 Commits
Results		Results
Scenarios		Scenarios
Scripts		Scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fixing Insecure AI-Generated Code through Model Fine-Tuning and Prompting Strategies

Description

Prerequisites

Repository Overview

Per-Language Structure

Steps

Step 1 - Scenario Preparation

Step 2 - Baseline Code Generation (CopilotRaw)

Step 3 - Prompt-Based Refinement Experiments

Alternatively, the `automateCopilot.py` file in the corresponding `ScenarioX` folder can be used to automate this process.

Step 4 - CodeQL Security Analysis

Step 5 - Fine-Tuning (LoRA)

DeepSeek R1 32B Fine-Tuning

Step 6 - Collating Results

Notes on reproducibility

✏️ Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Fixing Insecure AI-Generated Code through Model Fine-Tuning and Prompting Strategies

Description

Prerequisites

Repository Overview

Per-Language Structure

Steps

Step 1 - Scenario Preparation

Step 2 - Baseline Code Generation (CopilotRaw)

Step 3 - Prompt-Based Refinement Experiments

Alternatively, the automateCopilot.py file in the corresponding ScenarioX folder can be used to automate this process.

Step 4 - CodeQL Security Analysis

Step 5 - Fine-Tuning (LoRA)

DeepSeek R1 32B Fine-Tuning

Step 6 - Collating Results

Notes on reproducibility

✏️ Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Alternatively, the `automateCopilot.py` file in the corresponding `ScenarioX` folder can be used to automate this process.

Packages