# üèãÔ∏è NSSS Security Auditor - Colab Trainer

This notebook fine-tunes the Qwen2.5-Coder model using the NSSS Few-Shot Registry.

**Workflow:**
1.  **Prepare Data:** Downloads real CVEFixes dataset from HuggingFace and filters for Python.
2.  **Fine-tune:** Uses Unsloth (QLoRA) to train on T4/L4 GPU.
3.  **Save:** Exports the fine-tuned model to `outputs/qwen-security-model` on Drive.

**Prerequisites:**
-   **Runtime:** GPU (T4 is sufficient, A100 is faster).

In [None]:
#@title 1. Configuration & Drive Mount
import os
from google.colab import drive

#@markdown ### üìÇ Project Settings
DRIVE_ROOT = "/content/drive/MyDrive/NSSS_Project" #@param {type:"string"}
GITHUB_REPO = "https://github.com/Hieureal1305/Neuro-Symbolic_Software_Security.git" #@param {type:"string"}

#@markdown ### üîÑ Sync Options
UPDATE_FROM_GITHUB = True #@param {type:"boolean"}

# Mount Drive
if not os.path.exists("/content/drive"):
    drive.mount("/content/drive")

print("‚úÖ Google Drive mounted at /content/drive")

In [None]:
#@title 2. Smart Sync (Drive <-> Colab)
import shutil
import subprocess

def run_cmd(cmd, cwd=None):
    print(f"‚ö° Running: {cmd}")
    subprocess.run(cmd, shell=True, check=True, cwd=cwd)

# 1. Ensure Drive Project Folder Exists
if not os.path.exists(DRIVE_ROOT):
    print(f"üìÇ Creating project folder at {DRIVE_ROOT}...")
    os.makedirs(DRIVE_ROOT, exist_ok=True)
    # Initial Clone
    run_cmd(f"git clone {GITHUB_REPO} .", cwd=DRIVE_ROOT)
else:
    # 2. Optional Update
    if UPDATE_FROM_GITHUB:
        print("üîÑ Updating code from GitHub...")
        if os.path.exists(os.path.join(DRIVE_ROOT, ".git")):
            run_cmd("git pull", cwd=DRIVE_ROOT)

# 3. Setup Workspace on Colab VM
WORKSPACE = "/content/app"

if os.path.exists(WORKSPACE):
    shutil.rmtree(WORKSPACE)

print("üöÄ Copying code to Colab Runtime...")
shutil.copytree(
    DRIVE_ROOT,
    WORKSPACE,
    ignore=shutil.ignore_patterns("outputs", "data", "venv", ".git", "__pycache__")
)

# 4. Link Data & Outputs to Drive
outputs_drive = os.path.join(DRIVE_ROOT, "outputs")
outputs_app = os.path.join(WORKSPACE, "outputs")

data_drive = os.path.join(DRIVE_ROOT, "data")
data_app = os.path.join(WORKSPACE, "data")

if not os.path.exists(outputs_drive): os.makedirs(outputs_drive)
if not os.path.exists(data_drive): os.makedirs(data_drive)

os.symlink(outputs_drive, outputs_app)
os.symlink(data_drive, data_app)

os.chdir(WORKSPACE)
print(f"üìç Working directory: {os.getcwd()}")

In [None]:
#@title 3. Install Dependencies
print("üì¶ Installing Unsloth...")
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" --progress-bar off
!pip install --no-deps "xformers<0.0.27" "trl<0.8.6" peft accelerate bitsandbytes --progress-bar off

if os.path.exists("requirements.txt"):
    !pip install -r requirements.txt --progress-bar off

print("‚úÖ Environment Ready!")

In [None]:
#@title 4. Prepare Data (Real CVEs)
#@markdown This step downloads CVEFixes from HuggingFace, filters for Python,
#@markdown and generates `data/few_shot_registry.json`. 
#@markdown It may take a few minutes.

LIMIT = 2000 #@param {type:"integer"}

!python scripts/prepare_cve_data.py --limit $LIMIT

In [None]:
#@title 5. Run Training
#@markdown This step fine-tunes the model and saves it directly to `DRIVE_ROOT/outputs`.

!python scripts/train_model.py --registry data/few_shot_registry.json --output outputs/qwen-security-model