# EQM Training for Darcy Flow Neural Operator

This notebook trains an Equilibrium Matching model to learn the mapping from permeability fields a(x,y) to solution fields u(x,y) for Darcy flow PDEs.

**What this does:** Learns a conditional flow from input a(x,y) ‚Üí output u(x,y) using paired data.

## Before Running:
1. **Enable GPU**: Runtime ‚Üí Change runtime type ‚Üí GPU
2. **Upload HDF5**: Upload `2D_DarcyFlow_beta1.0_Train.hdf5` to Google Drive
3. **Update path**: Change `DRIVE_DATA_PATH` in Cell 3

## Recommended Workflow:
1. Run Cells 1-5 (setup and verification)
2. **Run Cell 6 (TensorBoard) - Keep this cell running!**
3. Run Cell 7 (start training) in a separate view
4. Monitor progress in TensorBoard while training runs
5. Save results (Cells 8-9) after training completes

## Step 1: Setup - Clone Repository and Install Dependencies

In [None]:
# Clone repository
!git clone https://github.com/MehdiMHeydari/EQM-Training.git
%cd EQM-Training

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

print("\n" + "="*60)
print("Installing dependencies... (this may take 3-5 minutes)")
print("="*60)

In [None]:
# Install dependencies
!pip install -q torch torchvision
!pip install -q h5py einops omegaconf tensorboard POT
!pip install -q -e .

print("\n‚úÖ All dependencies installed!")

## Step 2: Copy Data from Google Drive

In [None]:
import os
import shutil

# ‚ö†Ô∏è CHANGE THIS PATH to match your Google Drive location!
DRIVE_DATA_PATH = "/content/drive/MyDrive/2D_DarcyFlow_beta1.0_Train.hdf5"

# Local path (don't change)
LOCAL_DATA_PATH = "data/2D_DarcyFlow_beta1.0_Train.hdf5"

print("Copying data from Google Drive...")
print(f"Source: {DRIVE_DATA_PATH}")
print(f"Destination: {LOCAL_DATA_PATH}")

if os.path.exists(DRIVE_DATA_PATH):
    os.makedirs("data", exist_ok=True)
    shutil.copy(DRIVE_DATA_PATH, LOCAL_DATA_PATH)
    
    # Verify
    size_mb = os.path.getsize(LOCAL_DATA_PATH) / (1024**2)
    print(f"\n‚úÖ Data copied successfully!")
    print(f"   File size: {size_mb:.2f} MB")
else:
    print(f"\n‚ùå ERROR: File not found at {DRIVE_DATA_PATH}")
    print("Please update DRIVE_DATA_PATH in this cell!")
    raise FileNotFoundError(f"Data file not found: {DRIVE_DATA_PATH}")

## Step 3: Verify Setup

In [None]:
# Test imports
print("Testing imports...")
from physics_flow_matching.utils.dataset import DarcyFlow
from physics_flow_matching.unet.unet_bb import UNetModelWrapper
from torchcfm.conditional_flow_matching import EquilibriumMatching
import torch

print("‚úÖ All imports successful!")

# Test GPU
if torch.cuda.is_available():
    print(f"‚úÖ GPU available: {torch.cuda.get_device_name(0)}")
    print(f"   GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
else:
    print("‚ö†Ô∏è  WARNING: No GPU detected!")
    print("   Go to: Runtime ‚Üí Change runtime type ‚Üí GPU")

# Test dataset loading with conditional format
print("\nTesting dataset...")
dataset = DarcyFlow(
    hdf5_path="data/2D_DarcyFlow_beta1.0_Train.hdf5",
    normalize=True,
    use_eqm_format=False  # Conditional: x0=a(x,y), x1=u(x,y)
)

print(f"‚úÖ Dataset loaded: {len(dataset)} samples")
print(f"   Sample shape: {dataset.shape}")

# Test sample
x0, x1 = dataset[0]
print(f"   x0 (input a) shape: {x0.shape}")
print(f"   x1 (output u) shape: {x1.shape}")
print(f"   x0 stats: min={x0.min():.3f}, max={x0.max():.3f}, mean={x0.mean():.3f}")
print(f"   x1 stats: min={x1.min():.3f}, max={x1.max():.3f}, mean={x1.mean():.3f}")

print("\nüéâ Everything is ready for training!")
print("Model will learn: a(x,y) ‚Üí u(x,y)")

## Step 4: View/Modify Configuration (Optional)

In [None]:
# View current config
!cat configs/darcy_flow_eqm.yaml

In [None]:
# Optional: Modify config for Colab
from omegaconf import OmegaConf

config = OmegaConf.load("configs/darcy_flow_eqm.yaml")

# Adjust for Colab
config.device = "cuda"  # Use GPU
config.dataloader.batch_size = 32  # Adjust if needed
config.num_epochs = 100  # Change as desired
config.save_epoch_int = 10  # Save every 10 epochs

# Save modified config
OmegaConf.save(config, "configs/darcy_flow_eqm.yaml")
print("‚úÖ Config updated!")
print(f"   Device: {config.device}")
print(f"   Batch size: {config.dataloader.batch_size}")
print(f"   Epochs: {config.num_epochs}")

## Step 5: Launch TensorBoard (BEFORE Training)

**‚ö†Ô∏è IMPORTANT**: Run this cell FIRST, then run the training cell below.

This cell will keep running and display training metrics in real-time.
You can scroll down and start training while TensorBoard runs.

In [None]:
# Launch TensorBoard (run this BEFORE training)
%load_ext tensorboard
%tensorboard --logdir experiments/darcy_flow_eqm

print("\nüìä TensorBoard is running!")
print("Scroll down and run the next cell to start training.")
print("Training metrics will appear here in real-time.")

## Step 6: Start Training! üöÄ

**Run this cell AFTER starting TensorBoard above.**

Training will run here, while TensorBoard displays metrics above.

**Tip**: You can scroll between this cell and TensorBoard to monitor progress!

In [None]:
# Start training
print("üöÄ Starting training...")
print("Monitor progress in TensorBoard above!\n")

!python physics_flow_matching/train_scripts/train_unet_eqm.py configs/darcy_flow_eqm.yaml

print("\n‚úÖ Training complete!")

## Step 7: Save Results to Google Drive

In [None]:
import shutil
import os

# Paths
experiment_path = "experiments/darcy_flow_eqm"
drive_save_path = "/content/drive/MyDrive/EQM_Experiments"

if os.path.exists(experiment_path):
    print(f"Copying experiment to Google Drive...")
    print(f"Destination: {drive_save_path}")
    
    # Copy entire experiment folder
    shutil.copytree(experiment_path, drive_save_path, dirs_exist_ok=True)
    
    print(f"\n‚úÖ Experiment saved to Google Drive!")
    print(f"   Location: {drive_save_path}")
    
    # List saved checkpoints
    checkpoint_dir = os.path.join(drive_save_path, "exp_1/saved_state")
    if os.path.exists(checkpoint_dir):
        checkpoints = [f for f in os.listdir(checkpoint_dir) if f.endswith('.pth')]
        print(f"\n   Saved checkpoints ({len(checkpoints)}):")
        for ckpt in sorted(checkpoints):
            print(f"      - {ckpt}")
else:
    print("‚ùå No experiment folder found!")
    print("   Make sure training has started.")

## Step 8: Download Checkpoints (Alternative to Drive)

In [None]:
# Zip and download checkpoints
from google.colab import files

!zip -r checkpoints.zip experiments/darcy_flow_eqm/exp_1/saved_state/
files.download('checkpoints.zip')

print("‚úÖ Checkpoints zipped and downloading...")

---

## üìù Workflow Summary

### Correct Order:
1. ‚úÖ Cells 1-4: Setup and configuration
2. ‚úÖ **Cell 5: Launch TensorBoard** (keep running)
3. ‚úÖ **Cell 6: Start training** (runs while TensorBoard displays metrics)
4. ‚úÖ Cells 7-8: Save results after training completes

### Tips:
- **TensorBoard refreshes automatically** - just scroll up to check progress
- **Training output appears in Cell 6** - you'll see epoch updates there
- **Both cells run simultaneously** - this is the correct behavior!
- **Checkpoints save automatically** - every 10 epochs by default

---

## Troubleshooting

### Out of Memory (OOM)
Reduce batch size in Step 4:
```python
config.dataloader.batch_size = 16  # or 8, or 4
```

### Training too slow
Check GPU is enabled:
```python
!nvidia-smi
```

### Runtime disconnected
Resume training:
1. Re-run cells 1-3
2. Modify config: `config.restart = True`, `config.restart_epoch = <last_epoch>`
3. Re-run training

### TensorBoard shows "No dashboards active"
Wait a few seconds after starting training - metrics appear after first epoch.

### Need help?
Check `COLAB_SETUP.md` for detailed troubleshooting guide.