# EnStack: Google Colab Deployment

This notebook automates the setup and execution of the EnStack project on Google Colab.

### Prerequisite
1. Create a folder named `EnStack_Data` in your Google Drive root.
2. Upload your data files (`train_processed.pkl`, `val_processed.pkl`, `test_processed.pkl`) into that folder.

## 1. Mount Google Drive

In [None]:
from google.colab import drive
import os

print("üìÇ Connecting to Google Drive...")
drive.mount('/content/drive')

# Verify Drive connection
if os.path.exists('/content/drive/MyDrive'):
    print("‚úÖ Google Drive connected successfully!")
else:
    print("‚ùå Failed to connect to Drive.")

## 2. Clone Repository
Choose **Public** if your repo is open, or **Private** if you need a token.

In [None]:
import os
from getpass import getpass

# @markdown ### Repository Settings
REPO_TYPE = "Public" # @param ["Public", "Private"]
USERNAME = "TCTri205" # @param {type:"string"}
REPO_NAME = "EnStack-paper" # @param {type:"string"}

# Construct URL
if REPO_TYPE == "Public":
    REPO_URL = f"https://github.com/{USERNAME}/{REPO_NAME}.git"
else:
    print("üîë Enter your Personal Access Token (PAT):")
    token = getpass()
    REPO_URL = f"https://{token}@github.com/{USERNAME}/{REPO_NAME}.git"

# Clone
%cd /content
if not os.path.exists(REPO_NAME):
    print(f"‚¨áÔ∏è Cloning {REPO_NAME}...")
    !git clone {REPO_URL}
else:
    print("üîÑ Repository exists. Pulling latest changes...")
    !cd {REPO_NAME} && git pull

# Change directory to project root
%cd /content/{REPO_NAME}
print(f"‚úÖ Current working directory: {os.getcwd()}")

## 3. Install Environment

In [None]:
print("üì¶ Installing dependencies...")
!pip install -r requirements.txt -q

# Install additional useful packages for Colab
!pip install pyyaml tqdm scikit-learn transformers torch -q

print("‚úÖ Environment setup complete.")

## 4. Download & Prepare Real Data (Draper VDISC)
This step downloads the dataset from Hugging Face and processes it into the required format.

In [None]:
# Configurable sample size (set to 0 or None for full dataset)
SAMPLE_SIZE = 20000 # @param {type:"integer"}

print(f"üîÑ Downloading and processing data (Sample size: {SAMPLE_SIZE})...")
if SAMPLE_SIZE > 0:
    !python scripts/prepare_data.py --output_dir /content/drive/MyDrive/EnStack_Data --sample {SAMPLE_SIZE}
else:
    !python scripts/prepare_data.py --output_dir /content/drive/MyDrive/EnStack_Data

print("‚úÖ Data preparation complete.")

## 5. Verify Data Configuration

In [None]:
import yaml
import os

CONFIG_PATH = "configs/config.yaml"

# Load config
if os.path.exists(CONFIG_PATH):
    with open(CONFIG_PATH, 'r') as f:
        config = yaml.safe_load(f)
    
    data_root = config['data']['root_dir']
    print(f"üîç Configured data path: {data_root}")
    
    if os.path.exists(data_root):
        print("‚úÖ Data directory found on Drive!")
        print("   Files:", os.listdir(data_root))
    else:
        print(f"‚ùå Directory '{data_root}' not found.")
        print("‚ö†Ô∏è Please ensure you created 'EnStack_Data' in MyDrive and uploaded your .pkl files.")
else:
    print("‚ùå config.yaml not found. Did the repo clone correctly?")

## 5. Run Training Pipeline

In [None]:
# Run the main training script
!python scripts/train.py --config configs/config.yaml