# VacuaGym Quickstart Tutorial

This notebook demonstrates the complete VacuaGym pipeline:
1. Loading datasets
2. Exploring features
3. Examining stability labels
4. Training baseline models
5. Running active learning

**Prerequisites**: Run Phase 1-3 scripts to generate data

In [None]:
import sys
sys.path.append('..')  # Add parent directory to path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

## 1. Load Datasets

VacuaGym provides three main datasets:
- **KS**: Kreuzer-Skarke reflexive polytopes
- **CICY**: Complete intersection Calabi-Yau threefolds
- **F-theory**: Toric base surfaces

In [None]:
# Load features
ks_features = pd.read_parquet('../data/processed/tables/ks_features.parquet')
cicy_features = pd.read_parquet('../data/processed/tables/cicy3_features.parquet')
fth_features = pd.read_parquet('../data/processed/tables/fth6d_graph_features.parquet')

print(f"KS polytopes: {len(ks_features):,}")
print(f"CICY configs: {len(cicy_features):,}")
print(f"F-theory bases: {len(fth_features):,}")

In [None]:
# Load stability labels
labels = pd.read_parquet('../data/processed/labels/toy_eft_stability.parquet')
print(f"Total labeled geometries: {len(labels):,}")

# Label distribution
print("\nStability distribution:")
print(labels['stability'].value_counts())

## 2. Explore Features

Let's visualize the Hodge numbers and topological properties

In [None]:
# CICY Hodge number distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(cicy_features['h11'], bins=50, alpha=0.7, label='h^{1,1}')
axes[0].hist(cicy_features['h21'], bins=50, alpha=0.7, label='h^{2,1}')
axes[0].set_xlabel('Hodge Number')
axes[0].set_ylabel('Count')
axes[0].set_title('CICY Hodge Number Distribution')
axes[0].legend()

axes[1].scatter(cicy_features['h11'], cicy_features['h21'], alpha=0.5, s=10)
axes[1].set_xlabel('h^{1,1}')
axes[1].set_ylabel('h^{2,1}')
axes[1].set_title('CICY Hodge Diamond')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 3. Examine Stability Labels

These labels were generated using our toy EFT model

In [None]:
# Plot stability distribution by dataset
fig, ax = plt.subplots(figsize=(10, 6))

stability_by_dataset = labels.groupby(['dataset', 'stability']).size().unstack(fill_value=0)
stability_by_dataset.plot(kind='bar', stacked=True, ax=ax)

ax.set_xlabel('Dataset')
ax.set_ylabel('Count')
ax.set_title('Stability Distribution by Dataset')
ax.legend(title='Stability')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Examine potential values for stable vs unstable geometries
stable_labels = labels[labels['stability'] == 'stable']
unstable_labels = labels[labels['stability'] == 'unstable']

fig, ax = plt.subplots(figsize=(10, 6))

if len(stable_labels) > 0:
    ax.hist(stable_labels['potential_value'], bins=50, alpha=0.6, label='Stable', density=True)
if len(unstable_labels) > 0:
    ax.hist(unstable_labels['potential_value'], bins=50, alpha=0.6, label='Unstable', density=True)

ax.set_xlabel('Potential Value V(Ï†*)')
ax.set_ylabel('Density')
ax.set_title('Potential Value Distribution by Stability')
ax.legend()
plt.tight_layout()
plt.show()

## 4. Load Benchmark Splits

VacuaGym provides IID and OOD splits for benchmarking

In [None]:
import json

# Load IID split
with open('../data/processed/splits/iid_split.json', 'r') as f:
    iid_split = json.load(f)

print("IID Split:")
print(f"  Train: {iid_split['train_size']:,}")
print(f"  Val:   {iid_split['val_size']:,}")
print(f"  Test:  {iid_split['test_size']:,}")

# Load OOD split
with open('../data/processed/splits/ood_complexity_split.json', 'r') as f:
    ood_split = json.load(f)

print("\nOOD Complexity Split:")
print(f"  Train: {ood_split['train_size']:,}")
print(f"  Val:   {ood_split['val_size']:,}")
print(f"  Test:  {ood_split['test_size']:,}")
print(f"  Threshold: {ood_split.get('complexity_threshold', 'N/A')}")

## 5. Train a Simple Model

Let's train a quick baseline model

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, f1_score, classification_report

# Merge CICY features with labels
df = cicy_features.merge(labels[labels['dataset'] == 'cicy3'], 
                         left_on='cicy_id', 
                         right_on='geometry_id',
                         how='inner')

print(f"Merged dataset: {len(df)} samples")

# Select numeric features
feature_cols = [col for col in df.columns 
                if df[col].dtype in [np.float64, np.int64] 
                and col not in ['cicy_id', 'geometry_id']]

X = df[feature_cols].fillna(0).values
y = df['stability'].values

# Encode labels
le = LabelEncoder()
y_encoded = le.fit_transform(y)

print(f"Features: {len(feature_cols)}")
print(f"Classes: {le.classes_}")

In [None]:
# Split data using IID split
X_train = X[iid_split['train'][:len(X)]]
y_train = y_encoded[iid_split['train'][:len(X)]]
X_test = X[iid_split['test'][:len(X)]]
y_test = y_encoded[iid_split['test'][:len(X)]]

# Train Random Forest
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Evaluate
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='macro')

print(f"\nTest Accuracy: {accuracy:.4f}")
print(f"Test F1 Score: {f1:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=le.classes_))

## 6. Feature Importance

Which features are most important for stability prediction?

In [None]:
# Get feature importances
importances = clf.feature_importances_
indices = np.argsort(importances)[-20:]  # Top 20

plt.figure(figsize=(10, 8))
plt.barh(range(len(indices)), importances[indices])
plt.yticks(range(len(indices)), [feature_cols[i] for i in indices])
plt.xlabel('Feature Importance')
plt.title('Top 20 Most Important Features')
plt.tight_layout()
plt.show()

## Next Steps

1. Explore other datasets (KS, F-theory)
2. Try different models (see `scripts/50_train_baseline_tabular.py`)
3. Run active learning loop (`scripts/60_active_learning_scan.py`)
4. Customize toy EFT potential parameters
5. Add your own features and physics constraints

See the documentation for more details!