# Vesuvius Challenge Surface Detection - Kaggle Training

**Optimized for Kaggle GPU (T4/P100 16GB)**

## üöÄ Kullanƒ±m:
1. Kaggle'da yeni notebook olu≈ütur
2. Settings ‚Üí Accelerator ‚Üí **GPU T4 x2** se√ß
3. Add Data ‚Üí Vesuvius Challenge dataset ekle
4. Bu notebook'u kopyala-yapƒ±≈ütƒ±r
5. **"Save & Run All"** (Commit) tƒ±kla
6. Bilgisayarƒ± kapatabilirsin - Kaggle √ßalƒ±≈ümaya devam eder!

## ‚è∞ S√ºre:
- Training: ~3-4 saat (50 epoch)
- GPU Quota: 30 saat/hafta (√ºcretsiz)
- Max session: 9 saat

## üíæ Output:
- Model checkpoints: `/kaggle/working/checkpoints/`
- Predictions: `/kaggle/working/outputs/`
- Notebook commit edilince bunlar kaydedilir

## 1Ô∏è‚É£ Setup & Installation

In [None]:
%%time
# Clone repository
!git clone https://github.com/EmreUludasdemir/Vesuvius-Challenge-Surface-Detection.git
%cd Vesuvius-Challenge-Surface-Detection

print("‚úì Repository cloned!")

In [None]:
%%time
# Install missing packages (Kaggle already has most)
!pip install -q segmentation-models-pytorch==0.3.3
!pip install -q monai
!pip install -q einops
!pip install -q omegaconf

print("‚úì Dependencies installed!")

## 2Ô∏è‚É£ Environment Check

In [None]:
import os
import sys
import time
import psutil
import numpy as np
import torch
from pathlib import Path

# Add to path
sys.path.append('/kaggle/working/Vesuvius-Challenge-Surface-Detection')

print("="*60)
print("KAGGLE ENVIRONMENT INFO")
print("="*60)

# CPU & RAM
print(f"CPU cores: {psutil.cpu_count()}")
print(f"Total RAM: {psutil.virtual_memory().total / 1024**3:.2f} GB")
print(f"Available RAM: {psutil.virtual_memory().available / 1024**3:.2f} GB")

# GPU
print(f"\nCUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")
    
    # Memory benchmark
    gpu_mem_total = torch.cuda.get_device_properties(0).total_memory / 1024**3
    if gpu_mem_total < 20:
        print("\n‚ö†Ô∏è Low memory GPU detected (T4/P100)")
        print("   Using optimized config: batch_size=2, features=16")
        GPU_CONFIG = 'low_memory'
    else:
        print("\n‚úì High memory GPU detected (A100)")
        print("   Using default config: batch_size=4, features=32")
        GPU_CONFIG = 'default'
else:
    print("\n‚ùå No GPU available! Enable GPU in Settings.")
    GPU_CONFIG = 'cpu'

# Directories
print(f"\nWorking dir: {os.getcwd()}")
print(f"Kaggle input: /kaggle/input/")
print(f"Kaggle output: /kaggle/working/")

print("\n" + "="*60)

## 3Ô∏è‚É£ Import Modules

In [None]:
from src.data.preprocessing import VolumeLoader, PatchExtractor
from src.data.augmentations import VolumeAugmentationPipeline, ZTranslationAugment
from src.models.sobel_baseline import SobelSurfaceDetector
from src.models.unet3d import UNet3DDepthInvariant, count_parameters
from src.training.losses import CombinedLoss
from src.training.trainer import Trainer, compute_dice, compute_iou

import matplotlib.pyplot as plt
from torch.utils.data import Dataset, DataLoader
from tqdm.auto import tqdm

print("‚úì All modules imported successfully!")

## 4Ô∏è‚É£ Configuration

In [None]:
# Kaggle-optimized configuration
CONFIG = {
    # Data paths - ADJUST THESE!
    'data_root': '/kaggle/input/vesuvius-challenge-ink-detection',  # Change to your dataset path
    'fragment_id': 'train/1',  # Fragment to train on
    
    # Model (optimized for 16GB GPU)
    'model': {
        'in_channels': 65,
        'out_channels': 1,
        'base_features': 16 if GPU_CONFIG == 'low_memory' else 32,
        'depth': 3 if GPU_CONFIG == 'low_memory' else 4,
    },
    
    # Training (optimized for Kaggle)
    'training': {
        'num_epochs': 50,
        'batch_size': 2 if GPU_CONFIG == 'low_memory' else 4,
        'learning_rate': 1e-4,
        'use_amp': True,  # Mixed precision - CRITICAL for memory
        'num_workers': 2,
        'save_every': 5,  # Save checkpoint every 5 epochs
    },
    
    # Data
    'data': {
        'num_slices': 65,
        'patch_size': 128 if GPU_CONFIG == 'low_memory' else 256,
        'stride': 64 if GPU_CONFIG == 'low_memory' else 128,
        'val_split': 0.2,
    },
    
    # Paths
    'checkpoint_dir': Path('/kaggle/working/checkpoints'),
    'output_dir': Path('/kaggle/working/outputs'),
}

# Create directories
CONFIG['checkpoint_dir'].mkdir(exist_ok=True)
CONFIG['output_dir'].mkdir(exist_ok=True)

print("Configuration:")
print(f"  GPU Config: {GPU_CONFIG}")
print(f"  Batch size: {CONFIG['training']['batch_size']}")
print(f"  Base features: {CONFIG['model']['base_features']}")
print(f"  Patch size: {CONFIG['data']['patch_size']}")
print(f"  Num epochs: {CONFIG['training']['num_epochs']}")
print(f"\n‚úì Configuration ready!")

## 5Ô∏è‚É£ Data Loading

In [None]:
# Check if data exists
data_path = Path(CONFIG['data_root'])
fragment_path = data_path / CONFIG['fragment_id']

print(f"Looking for data at: {fragment_path}")

if not fragment_path.exists():
    print("\n‚ùå ERROR: Data not found!")
    print("\nAvailable paths in /kaggle/input/:")
    !ls -la /kaggle/input/
    print("\nPlease:")
    print("1. Add Vesuvius Challenge dataset in Kaggle notebook settings")
    print("2. Update CONFIG['data_root'] and CONFIG['fragment_id'] above")
    raise FileNotFoundError(f"Fragment not found: {fragment_path}")
else:
    print(f"‚úì Data found!\n")
    
    # List contents
    print(f"Contents of {fragment_path}:")
    !ls -la {fragment_path}
    
    # Check for surface_volume
    surface_volume_path = fragment_path / "surface_volume"
    if surface_volume_path.exists():
        num_slices = len(list(surface_volume_path.glob('*.tif')))
        print(f"\n‚úì Found {num_slices} CT slices")
    else:
        print("\n‚ö†Ô∏è Warning: surface_volume directory not found")

In [None]:
%%time
# Load volume
print("Loading 3D volume...")

loader = VolumeLoader(
    data_root=CONFIG['data_root'],
    fragment_id=CONFIG['fragment_id'],
    num_slices=CONFIG['data']['num_slices'],
    normalize=True
)

volume = loader.load_volume()
mask = loader.load_mask()
labels = loader.load_labels()

print(f"\nVolume shape: {volume.shape}")
print(f"Volume range: [{volume.min():.3f}, {volume.max():.3f}]")
print(f"Volume size: {volume.nbytes / 1024**2:.2f} MB")

if mask is not None:
    print(f"\nMask shape: {mask.shape}")
    print(f"Valid region: {mask.sum() / mask.size * 100:.2f}%")
else:
    print("\n‚ö†Ô∏è No mask found - using full volume")
    mask = np.ones(volume.shape[1:], dtype=np.uint8)

if labels is not None:
    print(f"\nLabels shape: {labels.shape}")
    print(f"Surface coverage: {labels.sum() / labels.size * 100:.2f}%")
else:
    print("\n‚ùå No labels found! Cannot train without ground truth.")
    print("This fragment may not have labels. Try a different fragment.")

## 6Ô∏è‚É£ Visualize Data

In [None]:
# Visualize middle slice
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# Volume
axes[0].imshow(volume[32], cmap='gray')
axes[0].set_title('Volume (Slice 32)')
axes[0].axis('off')

# Mask
if mask is not None:
    axes[1].imshow(mask, cmap='gray')
    axes[1].set_title('Valid Region Mask')
    axes[1].axis('off')

# Labels
if labels is not None:
    axes[2].imshow(labels, cmap='hot')
    axes[2].set_title('Surface Labels')
    axes[2].axis('off')

plt.tight_layout()
plt.savefig(CONFIG['output_dir'] / 'data_visualization.png', dpi=150, bbox_inches='tight')
plt.show()

print("‚úì Visualization saved to outputs/")

## 7Ô∏è‚É£ Create Dataset

In [None]:
class SurfaceDetectionDataset(Dataset):
    """Dataset for surface detection training."""
    
    def __init__(self, volume_patches, label_patches, augment=True):
        self.volume_patches = volume_patches
        self.label_patches = label_patches
        self.augment = augment
        
        if augment:
            self.aug_pipeline = VolumeAugmentationPipeline(
                z_translation_prob=0.5,
                max_z_shift=5,
                image_size=volume_patches[0].shape[-1],
                use_heavy_augs=True,
                is_training=True
            )
    
    def __len__(self):
        return len(self.volume_patches)
    
    def __getitem__(self, idx):
        volume = self.volume_patches[idx].copy()
        label = self.label_patches[idx].copy()
        
        # Augmentation
        if self.augment:
            volume, label = self.aug_pipeline(volume, label)
        
        # Convert to tensors
        volume = torch.from_numpy(volume).float()  # (D, H, W)
        label = torch.from_numpy(label).float().unsqueeze(0)  # (1, H, W)
        
        return {'image': volume, 'mask': label}

print("‚úì Dataset class defined")

In [None]:
%%time
# Extract patches
print("Extracting training patches...")

extractor = PatchExtractor(
    patch_size=CONFIG['data']['patch_size'],
    stride=CONFIG['data']['stride'],
    balanced_sampling=True,
    surface_ratio=0.5
)

vol_patches, label_patches, coords = extractor.extract_patches(
    volume, mask, labels
)

print(f"\nExtracted {len(vol_patches)} patches")
print(f"Patch shape: {vol_patches[0].shape}")

# Count surface vs non-surface
surface_patches = sum(1 for lp in label_patches if lp.sum() > 0)
print(f"\nSurface patches: {surface_patches}")
print(f"Non-surface patches: {len(label_patches) - surface_patches}")
print(f"Balance ratio: {surface_patches / len(label_patches) * 100:.1f}%")

In [None]:
# Train/val split
from sklearn.model_selection import train_test_split

train_vol, val_vol, train_labels, val_labels = train_test_split(
    vol_patches, label_patches,
    test_size=CONFIG['data']['val_split'],
    random_state=42
)

print(f"Training samples: {len(train_vol)}")
print(f"Validation samples: {len(val_vol)}")

# Create datasets
train_dataset = SurfaceDetectionDataset(train_vol, train_labels, augment=True)
val_dataset = SurfaceDetectionDataset(val_vol, val_labels, augment=False)

# Create dataloaders
train_loader = DataLoader(
    train_dataset,
    batch_size=CONFIG['training']['batch_size'],
    shuffle=True,
    num_workers=CONFIG['training']['num_workers'],
    pin_memory=True
)

val_loader = DataLoader(
    val_dataset,
    batch_size=CONFIG['training']['batch_size'] * 2,
    shuffle=False,
    num_workers=CONFIG['training']['num_workers'],
    pin_memory=True
)

print(f"\n‚úì Dataloaders ready!")
print(f"   Train batches: {len(train_loader)}")
print(f"   Val batches: {len(val_loader)}")

## 8Ô∏è‚É£ Initialize Model

In [None]:
# Device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")

# Model
model = UNet3DDepthInvariant(
    in_channels=CONFIG['model']['in_channels'],
    out_channels=CONFIG['model']['out_channels'],
    base_features=CONFIG['model']['base_features'],
    depth=CONFIG['model']['depth']
)

print(f"\nModel: UNet3DDepthInvariant")
print(f"Parameters: {count_parameters(model):,}")
print(f"Base features: {CONFIG['model']['base_features']}")
print(f"Depth: {CONFIG['model']['depth']}")

# Loss function
loss_fn = CombinedLoss(
    bce_weight=0.5,
    dice_weight=0.5,
    label_smoothing=0.1
)

# Optimizer
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=CONFIG['training']['learning_rate'],
    weight_decay=1e-5
)

# Scheduler
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=CONFIG['training']['num_epochs'],
    eta_min=1e-6
)

print("\n‚úì Model initialized!")

## 9Ô∏è‚É£ Training Loop

In [None]:
# Initialize trainer
trainer = Trainer(
    model=model,
    optimizer=optimizer,
    loss_fn=loss_fn,
    device=device,
    scheduler=scheduler,
    use_amp=CONFIG['training']['use_amp'],
    checkpoint_dir=CONFIG['checkpoint_dir'],
    use_wandb=False
)

print("‚úì Trainer initialized")
print(f"   Mixed precision: {CONFIG['training']['use_amp']}")
print(f"   Checkpoint dir: {CONFIG['checkpoint_dir']}")

In [None]:
%%time
# TRAINING - This will take 3-4 hours on Kaggle GPU
print("="*60)
print("STARTING TRAINING")
print("="*60)
print(f"Epochs: {CONFIG['training']['num_epochs']}")
print(f"This will take approximately {CONFIG['training']['num_epochs'] * 4:.0f} minutes")
print("\nüí° TIP: After clicking 'Save & Run All', you can close your browser!")
print("   Kaggle will continue training in the cloud.")
print("   Come back later to check results.")
print("="*60)
print()

start_time = time.time()

# Train!
trainer.fit(
    train_loader=train_loader,
    val_loader=val_loader,
    num_epochs=CONFIG['training']['num_epochs'],
    save_every=CONFIG['training']['save_every']
)

elapsed = time.time() - start_time
print(f"\n" + "="*60)
print(f"TRAINING COMPLETE!")
print(f"Total time: {elapsed/3600:.2f} hours")
print(f"Best validation loss: {trainer.best_val_loss:.4f}")
print("="*60)

## üîü Save Results

In [None]:
# Plot training curves
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Loss
axes[0].plot(trainer.train_losses, label='Train Loss')
axes[0].plot(trainer.val_losses, label='Val Loss')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training & Validation Loss')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Learning rate
lrs = [optimizer.param_groups[0]['lr']] * len(trainer.train_losses)
axes[1].plot(lrs)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Learning Rate')
axes[1].set_title('Learning Rate Schedule')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(CONFIG['output_dir'] / 'training_curves.png', dpi=150, bbox_inches='tight')
plt.show()

print("‚úì Training curves saved")

In [None]:
# Save final model separately
final_model_path = CONFIG['output_dir'] / 'final_model.pth'
torch.save({
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'config': CONFIG,
    'best_val_loss': trainer.best_val_loss,
}, final_model_path)

print(f"‚úì Final model saved to {final_model_path}")
print(f"   File size: {final_model_path.stat().st_size / 1024**2:.2f} MB")

## 1Ô∏è‚É£1Ô∏è‚É£ Inference Example

In [None]:
# Load best model
best_checkpoint = CONFIG['checkpoint_dir'] / 'best_model.pt'
checkpoint = torch.load(best_checkpoint)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

print(f"‚úì Loaded best model (val_loss: {checkpoint['best_val_loss']:.4f})")

# Predict on validation sample
sample_idx = 0
sample = val_dataset[sample_idx]

with torch.no_grad():
    image = sample['image'].unsqueeze(0).to(device)
    pred_logits = model(image)
    pred_probs = torch.sigmoid(pred_logits)
    pred_binary = (pred_probs > 0.96).float()

# Visualize
fig, axes = plt.subplots(1, 4, figsize=(20, 5))

axes[0].imshow(sample['image'][32].cpu(), cmap='gray')
axes[0].set_title('Input (Slice 32)')
axes[0].axis('off')

axes[1].imshow(sample['mask'][0].cpu(), cmap='hot')
axes[1].set_title('Ground Truth')
axes[1].axis('off')

axes[2].imshow(pred_probs[0, 0].cpu(), cmap='hot', vmin=0, vmax=1)
axes[2].set_title('Prediction (Probability)')
axes[2].axis('off')

axes[3].imshow(pred_binary[0, 0].cpu(), cmap='gray')
axes[3].set_title('Prediction (Binary, threshold=0.96)')
axes[3].axis('off')

plt.tight_layout()
plt.savefig(CONFIG['output_dir'] / 'prediction_example.png', dpi=150, bbox_inches='tight')
plt.show()

# Compute metrics
dice = compute_dice(pred_logits, sample['mask'].unsqueeze(0).to(device))
iou = compute_iou(pred_logits, sample['mask'].unsqueeze(0).to(device))

print(f"\nMetrics on sample:")
print(f"  Dice: {dice:.4f}")
print(f"  IoU: {iou:.4f}")

## üéâ Done!

### üì¶ Outputs (saved to `/kaggle/working/`):
- `checkpoints/best_model.pt` - Best model checkpoint
- `checkpoints/checkpoint_epoch_*.pt` - Intermediate checkpoints
- `outputs/final_model.pth` - Final model weights
- `outputs/training_curves.png` - Training visualization
- `outputs/prediction_example.png` - Sample prediction

### üì• Download Results:
When notebook finishes:
1. Click "Output" tab (top right)
2. Download all files

### üîÑ Next Steps:
1. Try different fragments
2. Increase epochs for better results
3. Ensemble multiple models
4. Apply to full scrolls

### üí° Tips:
- This notebook can run with **browser closed** if you use "Save & Run All"
- Kaggle will continue training for up to 9 hours
- Check back later to see results
- Don't forget to save outputs before session expires!