# GR00T Probe Training Notebook

This notebook provides a convenient interface to train the GR00T probe model.

## Features:
- **Linear Regression**: Simple probe with no hidden layers
- **Feature Types**: Choose between `mean_pooled` or `last_vector` features
- **Input Shape**: [2048] dimensional features
- **Easy Configuration**: Modify parameters in the cell below

## Requirements:
- Processed training data: `probe_training_data_150k_processed.parquet`
- Generated by running `getting_started/extract_probe_training_data.ipynb`

In [3]:
# ===== GR00T CLEAN SETUP IN COLAB =====

# Step 1: Clone repo
!git clone https://github.com/IdoXpoz/Isaac-GR00T-fork.git
%cd Isaac-GR00T-fork

Cloning into 'Isaac-GR00T-fork'...
remote: Enumerating objects: 756, done.[K
remote: Counting objects: 100% (398/398), done.[K
remote: Compressing objects: 100% (222/222), done.[K
remote: Total 756 (delta 281), reused 193 (delta 176), pack-reused 358 (from 3)[K
Receiving objects: 100% (756/756), 48.60 MiB | 14.99 MiB/s, done.
Resolving deltas: 100% (393/393), done.
/content/Isaac-GR00T-fork


In [4]:
!git fetch
!git checkout main


Already on 'main'
Your branch is up to date with 'origin/main'.


In [5]:
!git pull origin main

From https://github.com/IdoXpoz/Isaac-GR00T-fork
 * branch            main       -> FETCH_HEAD
Already up to date.


In [6]:
# Mount Google Drive for persistent storage
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 🔧 Configuration

Modify these parameters to customize your training:

In [9]:
# Training Configuration
FEATURE_TYPE = "mean_pooled"  # Options: "mean_pooled" or "last_vector"
DATA_PATH = "/content/drive/MyDrive/probe_training_data/probe_training_data_150k_processed.parquet"  # Path to processed data
BATCH_SIZE = 32
NUM_EPOCHS = 100

print(f"📊 Configuration:")
print(f"   • Feature Type: {FEATURE_TYPE}")
print(f"   • Data Path: {DATA_PATH}")
print(f"   • Batch Size: {BATCH_SIZE}")
print(f"   • Epochs: {NUM_EPOCHS}")

📊 Configuration:
   • Feature Type: mean_pooled
   • Data Path: /content/drive/MyDrive/probe_training_data/probe_training_data_150k_processed.parquet
   • Batch Size: 32
   • Epochs: 100


## 🚀 Start Training

Run the cell below to start training the probe model:

In [None]:
# Import and run training
import sys
import os

# Add current directory to path
sys.path.append(os.getcwd())

# Import training function
from probe.train_probe import main as train_main

print("🏁 Starting probe training...")
print("=" * 60)

# Run training with specified parameters
train_main(
    feature_type=FEATURE_TYPE,
    data_path=DATA_PATH,
    batch_size=BATCH_SIZE,
    num_epochs=NUM_EPOCHS
)

print("=" * 60)
print("🎉 Training completed!")

🏁 Starting probe training...
Using device: cuda
Feature type: mean_pooled
Loading data from /content/drive/MyDrive/probe_training_data/probe_training_data_150k_processed.parquet...
Using feature type: mean_pooled


## 📊 Check Training Results

After training, you can check if the output files were created:

In [None]:
import os

# Check output files in mounted drive structure
probe_output_dir = f"/content/drive/MyDrive/probes/{FEATURE_TYPE}"
output_files = [
    os.path.join(probe_output_dir, "best_probe_model.pth"),
    os.path.join(probe_output_dir, "training_history.pkl")
]

print(f"📁 Output Files (in {probe_output_dir}):")
for file_path in output_files:
    if os.path.exists(file_path):
        size_mb = os.path.getsize(file_path) / (1024 * 1024)
        print(f"   ✅ {os.path.basename(file_path)} ({size_mb:.2f} MB)")
    else:
        print(f"   ❌ {os.path.basename(file_path)} (not found)")

print(f"\n🎯 Feature type used: {FEATURE_TYPE}")
print(f"\n📁 All outputs saved to: {probe_output_dir}")
print("\n📝 Next steps:")
print("   1. Run evaluate_probe.ipynb to evaluate the trained model")
print("   2. Make sure to use the same feature type for evaluation")

## 🔄 Quick Feature Type Comparison

Want to compare both feature types? Run this cell to train both:

In [None]:
# Train both feature types for comparison
COMPARE_BOTH = False  # Set to True to train both feature types

if COMPARE_BOTH:
    print("🔄 Training both feature types for comparison...")

    feature_types = ["mean_pooled", "last_vector"]

    for ft in feature_types:
        print(f"\n🚀 Training with {ft} features...")
        print("=" * 50)

        # Train model
        train_main(
            feature_type=ft,
            data_path=DATA_PATH,
            batch_size=BATCH_SIZE,
            num_epochs=NUM_EPOCHS
        )

        # Rename output files to avoid overwriting
        import shutil
        if os.path.exists("probe/best_probe_model.pth"):
            shutil.move("probe/best_probe_model.pth", f"probe/best_probe_model_{ft}.pth")
        if os.path.exists("probe/training_history.pkl"):
            shutil.move("probe/training_history.pkl", f"probe/training_history_{ft}.pkl")

    print("\n🎉 Both models trained! Check probe/ directory for outputs.")
else:
    print("ℹ️  Set COMPARE_BOTH = True to train both feature types")