# EnStack: Advanced Stacking Ensemble for Vulnerability Detection

This notebook provides a professional, fully optimized pipeline for reproducing the results of the EnStack paper on Google Colab.

### ‚ö° Optimized Features:
1.  **High-Speed Training:** Automatic Mixed Precision (AMP) and Dynamic Padding (+5-8x speed).
2.  **Memory Efficient:** Lazy Loading and Gradient Checkpointing (Run large LLMs on T4 GPU).
3.  **Algorithmic Correctness:** K-Fold Out-of-Fold (OOF) stacking to prevent data leakage.
4.  **Advanced Visualization:** Confusion matrices, ROC curves, and Feature Importance plots.
5.  **Production Ready:** Export models to ONNX and TorchScript.

---

## 1. Environment Setup

In [None]:
import os
from google.colab import drive

# 1. Mount Drive
print("üìÇ Connecting to Google Drive...")
drive.mount('/content/drive')

# 2. Clone Repository
REPO_NAME = "EnStack-paper" # @param {type:"string"}
GITHUB_USER = "TCTri205" # @param {type:"string"}

%cd /content
if not os.path.exists(REPO_NAME):
    print(f"‚¨áÔ∏è Cloning {REPO_NAME}...")
    !git clone https://github.com/{GITHUB_USER}/{REPO_NAME}.git
else:
    print("üîÑ Repository exists. Pulling latest optimized version...")
    !cd {REPO_NAME} && git pull

%cd /content/{REPO_NAME}

# 3. Install Dependencies
print("üì¶ Installing high-performance dependencies...")
!pip install -r requirements.txt -q
!pip install transformers[torch] datasets pyarrow xgboost tensorboard seaborn matplotlib -q

print("\n‚úÖ Setup complete. Ready to train.")

## 2. Check Hardware Acceleration

In [None]:
import torch
import psutil

print("üîç Hardware Check:")
if torch.cuda.is_available():
    print(f"‚úÖ GPU: {torch.cuda.get_device_name(0)}")
    print(f"   VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("‚ùå GPU NOT FOUND. Please go to: Runtime -> Change runtime type -> T4 GPU")

print(f"‚úÖ System RAM: {psutil.virtual_memory().total / 1e9:.2f} GB")

## 3. Data Preparation
Choose to use the **Full Draper VDISC** dataset (paper reproduction) or **Dummy Data** (quick code test).

In [None]:
# @markdown ### Data Source Configuration
DATA_MODE = "Draper VDISC" # @param ["Draper VDISC", "Dummy Data"]
SAMPLE_SIZE = 5000 # @param {type:"integer"}

if DATA_MODE == "Draper VDISC":
    print("üöÄ Downloading and processing Draper VDISC (~1GB)...")
    !chmod +x scripts/setup_draper.sh
    !./scripts/setup_draper.sh
else:
    print(f"üîÑ Generating synthetic dummy data ({SAMPLE_SIZE} samples)...")
    !python scripts/prepare_data.py --output_dir /content/drive/MyDrive/EnStack_Data --mode synthetic --sample {SAMPLE_SIZE}

print("\n‚úÖ Data is ready on Google Drive.")

## 4. Run Optimized Training Pipeline
This cell executes the full training for base models (CodeBERT, etc.) and the Meta-classifier.

In [None]:
# @markdown ### Training Configuration
EPOCHS = 10 # @param {type:"integer"}
BATCH_SIZE = 16 # @param {type:"integer"}
ACCUMULATION_STEPS = 1 # @param {type:"integer"}
USE_SWA = False # @param {type:"boolean"}
RESUME = True # @param {type:"boolean"}

import yaml

# Update config.yaml with notebook parameters
with open('configs/config.yaml', 'r') as f:
    config = yaml.safe_load(f)

config['training']['epochs'] = EPOCHS
config['training']['batch_size'] = BATCH_SIZE
config['training']['gradient_accumulation_steps'] = ACCUMULATION_STEPS
config['training']['use_swa'] = USE_SWA

with open('configs/config.yaml', 'w') as f:
    yaml.dump(config, f)

print("üöÄ Starting Training Pipeline...")
!python scripts/train.py --config configs/config.yaml {'--resume' if RESUME else ''}

## 5. Meta-Classifier Comparison (Table III Reproduction)
Evaluate different meta-classifiers (SVM, Logistic Regression, XGBoost) on the same optimized features.

In [None]:
import sys
from pathlib import Path
import pandas as pd
import yaml
import torch
import numpy as np
from scripts.train import extract_all_features, train_base_models, load_labels_from_file
from src.stacking import (
    evaluate_meta_classifier,
    prepare_meta_features,
    train_meta_classifier,
)
from src.utils import get_device
from IPython.display import display

def reproduce_table_iii():
    print("üìä Comparing Meta-Classifiers (LR vs RF vs SVM vs XGBoost)...")
    
    with open("configs/config.yaml", 'r') as f:
        config = yaml.safe_load(f)
    
    device = get_device()
    root_dir = Path(config['data']['root_dir'])
    
    # 1. Load models and pre-created dataloaders
    trainers, dataloaders = train_base_models(config, config['model']['base_models'], 
                                             num_epochs=0, device=device, resume=True)
    
    # 2. Extract Optimized Features (with caching)
    features_dict = extract_all_features(config, trainers, dataloaders, mode="logits", use_cache=True)
    
    # 3. Load Labels
    train_labels = load_labels_from_file(root_dir / config['data']['train_file'])
    test_labels = load_labels_from_file(root_dir / config['data']['test_file'])
    
    # 4. Prepare Meta-features with Scaling/PCA
    train_meta, _, pca, scaler = prepare_meta_features(features_dict['train'], train_labels, use_pca=True, use_scaling=True)
    test_meta, _, _, _ = prepare_meta_features(features_dict['test'], pca_model=pca, scaler=scaler, use_pca=True, use_scaling=True)
    
    # 5. Iterative Evaluation
    results = []
    for m_type in ["lr", "rf", "svm", "xgboost"]:
        print(f"  > Training {m_type.upper()}...")
        params = config['model']['meta_classifier_params'].get(m_type, {})
        clf = train_meta_classifier(train_meta, train_labels, classifier_type=m_type, **params)
        metrics = evaluate_meta_classifier(clf, test_meta, test_labels)
        results.append({"Classifier": m_type.upper(), "Acc": metrics['accuracy']*100, "F1": metrics['f1']*100, "AUC": metrics['auc']*100})
    
    return pd.DataFrame(results)

comparison_df = reproduce_table_iii()
display(comparison_df)

## 6. Advanced Visualization

In [None]:
from IPython.display import Image
import glob

print("üìà Training Curves:")
hist_plots = glob.glob(f"{config['training']['output_dir']}/**/training_history.png", recursive=True)
for p in hist_plots:
    print(f"Source: {p}")
    display(Image(filename=p))

print("\nüéØ Final Confusion Matrix:")
display(Image(filename=f"{config['training']['output_dir']}/confusion_matrix.png"))

print("\n‚≠ê Feature Importance (Base Model Impact):")
display(Image(filename=f"{config['training']['output_dir']}/feature_importance.png"))

## 7. Model Export for Deployment

In [None]:
# Export the primary model to ONNX for 3x faster CPU inference
import sys
from src.models import create_model

print("üöÄ Exporting model for production...")
model_name = config['model']['base_models'][0]
model, _ = create_model(model_name, config, pretrained=False)

checkpoint_path = f"{config['training']['output_dir']}/{model_name}/last_checkpoint"
if os.path.exists(checkpoint_path):
    # Load weights from your best run
    model.load_state_dict(torch.load(f"{checkpoint_path}/pytorch_model.bin", map_location='cpu'), strict=False)
    
    onnx_path = f"{config['training']['output_dir']}/optimized_model.onnx"
    model.export_onnx(onnx_path)
    print(f"‚úÖ Successfully exported to: {onnx_path}")
else:
    print("‚ùå No checkpoint found to export.")