# EnStack: Google Colab Deployment

This notebook automates the setup and execution of the EnStack project on Google Colab.

### Prerequisite
1. Create a folder named `EnStack_Data` in your Google Drive root.
2. Upload your data files (`train_processed.pkl`, `val_processed.pkl`, `test_processed.pkl`) into that folder.

## 1. Mount Google Drive

In [None]:
from google.colab import drive
import os

print("üìÇ Connecting to Google Drive...")
drive.mount('/content/drive')

# Verify Drive connection
if os.path.exists('/content/drive/MyDrive'):
    print("‚úÖ Google Drive connected successfully!")
else:
    print("‚ùå Failed to connect to Drive.")

## 2. Clone Repository
Choose **Public** if your repo is open, or **Private** if you need a token.

In [None]:
import os
from getpass import getpass

# @markdown ### Repository Settings
REPO_TYPE = "Public" # @param ["Public", "Private"]
USERNAME = "TCTri205" # @param {type:"string"}
REPO_NAME = "EnStack-paper" # @param {type:"string"}

# Construct URL
if REPO_TYPE == "Public":
    REPO_URL = f"https://github.com/{USERNAME}/{REPO_NAME}.git"
else:
    print("üîë Enter your Personal Access Token (PAT):")
    token = getpass()
    REPO_URL = f"https://{token}@github.com/{USERNAME}/{REPO_NAME}.git"

# Clone
%cd /content
if not os.path.exists(REPO_NAME):
    print(f"‚¨áÔ∏è Cloning {REPO_NAME}...")
    !git clone {REPO_URL}
else:
    print("üîÑ Repository exists. Pulling latest changes...")
    !cd {REPO_NAME} && git pull

# Change directory to project root
%cd /content/{REPO_NAME}
print(f"‚úÖ Current working directory: {os.getcwd()}")

## 3. Check GPU Availability

In [None]:
import torch

print("üîç Checking GPU availability...")
if torch.cuda.is_available():
    print(f"‚úÖ GPU detected: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("‚ùå No GPU detected. Training will be VERY slow on CPU.")
    print("\n‚ö†Ô∏è  IMPORTANT: Enable GPU for faster training:")
    print("   1. Go to Runtime ‚Üí Change runtime type")
    print("   2. Select Hardware accelerator: T4 GPU")
    print("   3. Click Save and restart the notebook\n")
    
    # Ask user if they want to continue
    import time
    print("‚è≥ Waiting 10 seconds... Press 'Stop' button if you want to enable GPU first.")
    time.sleep(10)

## 4. Install Environment

In [None]:
print("üì¶ Installing dependencies...")
!pip install -r requirements.txt -q

# Install additional useful packages for Colab
!pip install pyyaml tqdm scikit-learn transformers torch -q

print("‚úÖ Environment setup complete.")

## 5. Download & Prepare Real Data
This step downloads a public vulnerability dataset and processes it into the required format.

In [None]:
# @markdown ### Data Configuration
# @markdown Choose data source:
# @markdown - **auto**: Try public dataset, fallback to synthetic
# @markdown - **public**: Use code_x_glue_cc_defect_detection
# @markdown - **synthetic**: Generate test data
# @markdown - **manual**: Show instructions for uploading Draper VDISC

DATA_MODE = "auto" # @param ["auto", "public", "synthetic", "manual"]
SAMPLE_SIZE = 5000 # @param {type:"integer"}

print(f"üîÑ Preparing data (Mode: {DATA_MODE}, Sample size: {SAMPLE_SIZE})...")
!python scripts/prepare_data.py --output_dir /content/drive/MyDrive/EnStack_Data --mode {DATA_MODE} --sample {SAMPLE_SIZE}

print("\n‚úÖ Data preparation complete.")

## 6. Verify Data Configuration

In [None]:
import yaml
import os

CONFIG_PATH = "configs/config.yaml"

# Load config
if os.path.exists(CONFIG_PATH):
    with open(CONFIG_PATH, 'r') as f:
        config = yaml.safe_load(f)
    
    data_root = config['data']['root_dir']
    print(f"üîç Configured data path: {data_root}")
    
    if os.path.exists(data_root):
        print("‚úÖ Data directory found on Drive!")
        print("   Files:", os.listdir(data_root))
    else:
        print(f"‚ùå Directory '{data_root}' not found.")
        print("‚ö†Ô∏è Please ensure you created 'EnStack_Data' in MyDrive and uploaded your .pkl files.")
else:
    print("‚ùå config.yaml not found. Did the repo clone correctly?")

## 7. Configure Training Parameters

In [None]:
# @markdown ### Training Configuration
# @markdown Reduce epochs for faster testing (default is 10 in config.yaml)

EPOCHS = 2 # @param {type:"integer"}
BATCH_SIZE = 16 # @param {type:"integer"}

print(f"üìã Training will use: {EPOCHS} epochs, batch size {BATCH_SIZE}")
print(f"‚è±Ô∏è  Estimated time per epoch: ~5-10 minutes on GPU, ~30-60 minutes on CPU")

## 8. Run Training Pipeline

In [None]:
# Run the main training script with custom parameters
!python scripts/train.py --config configs/config.yaml --epochs {EPOCHS} --batch-size {BATCH_SIZE}