# EQM Training for Darcy Flow Neural Operator

This notebook trains an Equilibrium Matching model for Darcy flow PDEs using **unconditional generation**.

**What this does:** Learns to generate solution fields u(x,y) from random noise, without using input permeability fields a(x,y).

## Before Running:
1. **Enable GPU**: Runtime ‚Üí Change runtime type ‚Üí GPU
2. **Upload HDF5**: Upload `2D_DarcyFlow_beta1.0_Train.hdf5` to Google Drive
3. **Update path**: Change `DRIVE_DATA_PATH` in Cell 3

## Recommended Workflow:
1. Run Cells 1-5 (setup and verification)
2. **Run Cell 6 (TensorBoard) - Keep this cell running!**
3. Run Cell 7 (start training) in a separate view
4. Monitor progress in TensorBoard while training runs
5. Sample from trained model (Cell 9) after training completes
6. Save results (Cells 10-11) after training completes

## Step 1: Setup - Clone Repository and Install Dependencies

In [None]:
# Clone repository
!git clone https://github.com/MehdiMHeydari/EQM-Training.git
%cd EQM-Training

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

print("\n" + "="*60)
print("Installing dependencies... (this may take 3-5 minutes)")
print("="*60)

In [None]:
# Install dependencies
!pip install -q torch torchvision
!pip install -q h5py einops omegaconf tensorboard POT
!pip install -q -e .

print("\n‚úÖ All dependencies installed!")

## Step 2: Copy Data from Google Drive

In [None]:
import os
import shutil

# ‚ö†Ô∏è CHANGE THIS PATH to match your Google Drive location!
DRIVE_DATA_PATH = "/content/drive/MyDrive/2D_DarcyFlow_beta1.0_Train.hdf5"

# Local path (don't change)
LOCAL_DATA_PATH = "data/2D_DarcyFlow_beta1.0_Train.hdf5"

print("Copying data from Google Drive...")
print(f"Source: {DRIVE_DATA_PATH}")
print(f"Destination: {LOCAL_DATA_PATH}")

if os.path.exists(DRIVE_DATA_PATH):
    os.makedirs("data", exist_ok=True)
    shutil.copy(DRIVE_DATA_PATH, LOCAL_DATA_PATH)
    
    # Verify
    size_mb = os.path.getsize(LOCAL_DATA_PATH) / (1024**2)
    print(f"\n‚úÖ Data copied successfully!")
    print(f"   File size: {size_mb:.2f} MB")
else:
    print(f"\n‚ùå ERROR: File not found at {DRIVE_DATA_PATH}")
    print("Please update DRIVE_DATA_PATH in this cell!")
    raise FileNotFoundError(f"Data file not found: {DRIVE_DATA_PATH}")

## Step 3: Verify Setup

In [None]:
# Test imports
print("Testing imports...")
from physics_flow_matching.utils.dataset import DarcyFlow
from physics_flow_matching.unet.unet_bb import UNetModelWrapper
from torchcfm.conditional_flow_matching import EquilibriumMatching
import torch

print("‚úÖ All imports successful!")

# Test GPU
if torch.cuda.is_available():
    print(f"‚úÖ GPU available: {torch.cuda.get_device_name(0)}")
    print(f"   GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
else:
    print("‚ö†Ô∏è  WARNING: No GPU detected!")
    print("   Go to: Runtime ‚Üí Change runtime type ‚Üí GPU")

# Test dataset loading with UNCONDITIONAL format
print("\nTesting dataset...")
dataset = DarcyFlow(
    hdf5_path="data/2D_DarcyFlow_beta1.0_Train.hdf5",
    normalize=True,
    use_eqm_format=True  # Unconditional: x0=empty, x1=u(x,y) only
)

print(f"‚úÖ Dataset loaded: {len(dataset)} samples")
print(f"   Sample shape: {dataset.shape}")

# Test sample
x0, x1 = dataset[0]
print(f"   x0 (unused) shape: {x0.shape}")
print(f"   x1 (output u) shape: {x1.shape}")
print(f"   x1 stats: min={x1.min():.3f}, max={x1.max():.3f}, mean={x1.mean():.3f}")

print("\nüéâ Everything is ready for training!")
print("Model will learn: noise ‚Üí u(x,y) (unconditional generation)")

## Step 4: View/Modify Configuration (Optional)

In [None]:
# View current config
!cat configs/darcy_flow_eqm.yaml

In [None]:
# Configure training settings
from omegaconf import OmegaConf

config = OmegaConf.load("configs/darcy_flow_eqm.yaml")

# Training configuration
config.device = "cuda"  # Use GPU
config.dataloader.batch_size = 32  # Adjust if needed (reduce to 16 or 8 if OOM)
config.num_epochs = 50  # Total epochs to train
config.save_epoch_int = 5  # Save checkpoint every 5 epochs
config.print_epoch_int = 1  # Print loss every epoch

# Auto-backup to Google Drive
config.drive_backup_path = "/content/drive/MyDrive/EQM_Checkpoints"  # Checkpoints saved here automatically

# Save modified config
OmegaConf.save(config, "configs/darcy_flow_eqm.yaml")

print("‚úÖ Config updated!")
print(f"   Device: {config.device}")
print(f"   Batch size: {config.dataloader.batch_size}")
print(f"   Total epochs: {config.num_epochs}")
print(f"   Save interval: Every {config.save_epoch_int} epochs")
print(f"   Auto-backup to: {config.drive_backup_path}")
print(f"\nüìÅ Checkpoints will be automatically saved to Google Drive during training!")

## Step 5: Launch TensorBoard (BEFORE Training)

**‚ö†Ô∏è IMPORTANT**: Run this cell FIRST, then run the training cell below.

This cell will keep running and display training metrics in real-time.
You can scroll down and start training while TensorBoard runs.

In [None]:
# Launch TensorBoard (run this BEFORE training)
%load_ext tensorboard
%tensorboard --logdir experiments/darcy_flow_eqm

print("\nüìä TensorBoard is running!")
print("Scroll down and run the next cell to start training.")
print("Training metrics will appear here in real-time.")

## Step 6: Start Training! üöÄ

**Run this cell AFTER starting TensorBoard above.**

Training will run for 50 epochs, automatically saving checkpoints to Google Drive every 5 epochs.

**Features:**
- ‚úÖ Checkpoints auto-saved to `/content/drive/MyDrive/EQM_Checkpoints/` every 5 epochs
- ‚úÖ TensorBoard metrics update in real-time
- ‚úÖ No need to run separate backup cell - it's automatic!

**Progress**: Watch TensorBoard above for training metrics!

In [None]:
# Start training
print("üöÄ Starting training...")
print("Monitor progress in TensorBoard above!\n")

!python physics_flow_matching/train_scripts/train_unet_eqm.py configs/darcy_flow_eqm.yaml

print("\n‚úÖ Training complete!")

## Step 7: Generate Samples from Trained Model (Unconditional)

After training completes, use the trained model to generate solution fields u(x,y) from random noise.

**What happens:**
1. Starts from random Gaussian noise
2. Follows energy gradients: dx/dœÑ = -‚àáE(x)
3. Converges to realistic solution fields u(x,y)

**Note:** Change `checkpoint_100.pth` to your actual checkpoint filename!

In [None]:
# Generate unconditional samples from noise
import numpy as np
import matplotlib.pyplot as plt
import torch
from tqdm import tqdm
from omegaconf import OmegaConf

# Ensure we're in the right directory
import os
os.chdir('/content/EQM-Training')

# Import from installed package
from physics_flow_matching.unet.unet_bb import UNetModelWrapper as UNetModel

# Sampling parameters
CHECKPOINT = "/content/drive/MyDrive/EQM_Checkpoints/checkpoint_25.pth"  # Change this!
NUM_SAMPLES = 16
NUM_STEPS = 100  # Gradient descent steps
STEP_SIZE = 0.01  # Step size for gradient descent

print(f"Generating {NUM_SAMPLES} unconditional samples...")
print(f"Checkpoint: {CHECKPOINT}")
print(f"Gradient descent: {NUM_STEPS} steps with step_size={STEP_SIZE}\n")

# Verify checkpoint exists
if not os.path.exists(CHECKPOINT):
    print(f"‚ùå ERROR: Checkpoint not found at {CHECKPOINT}")
    print("\nAvailable checkpoints in Google Drive:")
    ckpt_dir = "/content/drive/MyDrive/EQM_Checkpoints"
    if os.path.exists(ckpt_dir):
        checkpoints = [f for f in os.listdir(ckpt_dir) if f.endswith('.pth')]
        for ckpt in sorted(checkpoints):
            print(f"   - {os.path.join(ckpt_dir, ckpt)}")
    raise FileNotFoundError(f"Checkpoint not found: {CHECKPOINT}")

# Load config
config = OmegaConf.load("configs/darcy_flow_eqm.yaml")

# Setup device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Initialize model
print("Initializing model...")
model = UNetModel(
    dim=config.unet.dim,
    out_channels=config.unet.out_channels,
    num_channels=config.unet.num_channels,
    num_res_blocks=config.unet.res_blocks,
    channel_mult=config.unet.channel_mult,
    num_head_channels=config.unet.head_chans,
    attention_resolutions=config.unet.attn_res,
    dropout=config.unet.dropout,
    use_new_attention_order=config.unet.new_attn,
    use_scale_shift_norm=config.unet.film,
)

# Load checkpoint
print(f"Loading checkpoint...")
checkpoint = torch.load(CHECKPOINT, map_location=device)
model.load_state_dict(checkpoint['model_state_dict'])
model.to(device)
model.eval()

# Sample shape from config
sample_shape = tuple(config.unet.dim)  # (C, H, W)
print(f"Sample shape: {sample_shape}")

# Generate samples using gradient descent
print(f"\nGenerating {NUM_SAMPLES} samples...")
all_samples = []
batch_size = 16
num_batches = (NUM_SAMPLES + batch_size - 1) // batch_size

for i in tqdm(range(num_batches), desc="Generating samples"):
    current_batch_size = min(batch_size, NUM_SAMPLES - i * batch_size)
    
    # Start from random Gaussian noise
    x = torch.randn(current_batch_size, *sample_shape).to(device)
    
    # Gradient descent loop
    for step in range(NUM_STEPS):
        x.requires_grad_(True)
        
        with torch.enable_grad():
            # Compute energy E(x) = sum(x * model(x))
            pred = model(x)
            E = torch.sum(x * pred, dim=(1, 2, 3))
            
            # Compute gradient v = -‚àáE(x)
            grad = -torch.autograd.grad([E.sum()], [x], create_graph=False)[0]
        
        # Update x (gradient descent step)
        with torch.no_grad():
            x = x + STEP_SIZE * grad
    
    # Save final samples
    all_samples.append(x.detach().cpu().numpy())

samples = np.concatenate(all_samples, axis=0)[:NUM_SAMPLES]

# Save samples
np.save("samples_unconditional.npy", samples)
print(f"\n‚úÖ Generated {samples.shape[0]} samples!")
print(f"   Shape: {samples.shape}")
print(f"   Stats: min={samples.min():.3f}, max={samples.max():.3f}, mean={samples.mean():.3f}")

# Visualize samples
print("\nVisualizing samples...")
fig, axes = plt.subplots(4, 4, figsize=(12, 12))
for i, ax in enumerate(axes.flat):
    if i < len(samples):
        im = ax.imshow(samples[i, 0], cmap='viridis')
        ax.set_title(f'Sample {i}')
        ax.axis('off')
        plt.colorbar(im, ax=ax, fraction=0.046)

plt.tight_layout()
plt.savefig('unconditional_samples.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"\n‚úÖ Samples saved to: samples_unconditional.npy")
print(f"   Visualization: unconditional_samples.png")

## Step 8: Save Results to Google Drive

In [None]:
import shutil
import os

# Paths
experiment_path = "experiments/darcy_flow_eqm"
drive_save_path = "/content/drive/MyDrive/EQM_Experiments"

if os.path.exists(experiment_path):
    print(f"Copying experiment to Google Drive...")
    print(f"Destination: {drive_save_path}")
    
    # Copy entire experiment folder
    shutil.copytree(experiment_path, drive_save_path, dirs_exist_ok=True)
    
    print(f"\n‚úÖ Experiment saved to Google Drive!")
    print(f"   Location: {drive_save_path}")
    
    # List saved checkpoints
    checkpoint_dir = os.path.join(drive_save_path, "exp_1/saved_state")
    if os.path.exists(checkpoint_dir):
        checkpoints = [f for f in os.listdir(checkpoint_dir) if f.endswith('.pth')]
        print(f"\n   Saved checkpoints ({len(checkpoints)}):")
        for ckpt in sorted(checkpoints):
            print(f"      - {ckpt}")
else:
    print("‚ùå No experiment folder found!")
    print("   Make sure training has started.")

## Step 9: Download Checkpoints (Alternative to Drive)

In [None]:
# Zip and download checkpoints
from google.colab import files

!zip -r checkpoints.zip experiments/darcy_flow_eqm/exp_1/saved_state/
files.download('checkpoints.zip')

print("‚úÖ Checkpoints zipped and downloading...")

---

## üìù Workflow Summary

### Correct Order:
1. ‚úÖ Cells 1-4: Setup and configuration
2. ‚úÖ **Cell 5: Launch TensorBoard** (keep running)
3. ‚úÖ **Cell 6: Start training** (runs while TensorBoard displays metrics)
4. ‚úÖ Cells 7-8: Save results after training completes

### Tips:
- **TensorBoard refreshes automatically** - just scroll up to check progress
- **Training output appears in Cell 6** - you'll see epoch updates there
- **Both cells run simultaneously** - this is the correct behavior!
- **Checkpoints save automatically** - every 10 epochs by default

---

## Troubleshooting

### Out of Memory (OOM)
Reduce batch size in Step 4:
```python
config.dataloader.batch_size = 16  # or 8, or 4
```

### Training too slow
Check GPU is enabled:
```python
!nvidia-smi
```

### Runtime disconnected
Resume training:
1. Re-run cells 1-3
2. Modify config: `config.restart = True`, `config.restart_epoch = <last_epoch>`
3. Re-run training

### TensorBoard shows "No dashboards active"
Wait a few seconds after starting training - metrics appear after first epoch.

### Need help?
Check `COLAB_SETUP.md` for detailed troubleshooting guide.