# Piano Performance Evaluation - Pseudo-Label Pre-training

This notebook trains a baseline model on MAESTRO pseudo-labels.

**Goal**: Production-ready baseline model to compare against future expert labels

**Requirements:**
- Colab Pro (recommended for T4/V100 GPU)
- Google Drive for data and checkpoints
- HuggingFace account for MERT model
- Git repository pushed to GitHub

**Note**: This notebook installs packages directly into Colab's system Python (no venv needed)

---

## Google Drive Setup

Upload the MAESTRO dataset to your Google Drive in this structure:

```
MyDrive/
  piano_eval_data/
    maestro/                           # MAESTRO v3.0.0 dataset
      maestro-v3.0.0.csv               # Metadata file (required!)
      2004/
        audio/
        midi/
      2006/
        audio/
        midi/
      ...
  piano_eval_checkpoints/              # Empty folder (checkpoints will be saved here)
    pseudo_pretrain/
```

**Note**: Annotation files (JSONL) will be generated automatically in this notebook with correct paths. You only need the raw MAESTRO dataset.

## 1. Setup Environment

In [None]:
# Login to HF
import os
os.environ.pop("HF_TOKEN", None)
os.environ.pop("HUGGINGFACEHUB_API_TOKEN", None)
from huggingface_hub import login, HfApi
try:
    import getpass as gp
    raw = gp.getpass("Paste your Hugging Face token (input hidden): ")
    token = raw.decode() if isinstance(raw, (bytes, bytearray)) else raw
    if not isinstance(token, str):
        raise TypeError(f"Unexpected token type: {type(token).__name__}")
    token = token.strip()
    if not token:
        raise ValueError("Empty token provided")
    login(token=token, add_to_git_credential=False)
    who = HfApi().whoami(token=token)
    print(f"Logged in as: {who.get('name') or who.get('email') or 'OK'}")
except Exception as e:
    print(f"[HF Login] getpass flow failed: {e}")
    print("Falling back to interactive login widget...")
    login()
    try:
        who = HfApi().whoami()
        print(f"Logged in as: {who.get('name') or who.get('email') or 'OK'}")
    except Exception as e2:
        print(f"[HF Login] Verification skipped: {e2}")

In [None]:
# Mount Google Drive for data and checkpoints
from google.colab import drive
drive.mount('/content/drive')

# Verify data exists
import os
data_dir = '/content/drive/MyDrive/piano_eval_data'
checkpoint_dir = '/content/drive/MyDrive/piano_eval_checkpoints'

assert os.path.exists(data_dir), f"Data directory not found: {data_dir}"
assert os.path.exists(checkpoint_dir), f"Checkpoint directory not found: {checkpoint_dir}"

print(f"✓ Data directory: {data_dir}")
print(f"  Contents: {os.listdir(data_dir)}")
print(f"✓ Checkpoint directory: {checkpoint_dir}")

In [None]:
REPO_URL = "https://github.com/Jai-Dhiman/crescendai.git"
BRANCH = "main"

# Remove old clone if exists
!rm -rf /content/crescendai

# Clone fresh
!git clone --branch {BRANCH} {REPO_URL} /content/crescendai

# Navigate to model directory
%cd /content/crescendai/model

# Show git status
!git log -1 --oneline
!git status

In [None]:
# Install uv (fast Python package manager)
!curl -LsSf https://astral.sh/uv/install.sh | sh

# Add to PATH for this session
import os
os.environ['PATH'] = f"{os.environ['HOME']}/.cargo/bin:{os.environ['PATH']}"

In [None]:
# Install dependencies directly into system Python (no venv needed in Colab)
# Using uv pip for faster installation
!uv pip install --system -e .

# Verify installation
import os
os.environ['MPLBACKEND'] = 'Agg'

import torch
import pytorch_lightning

print(f"Dependencies installed")
print(f"PyTorch: {torch.__version__}")
print(f"Lightning: {pytorch_lightning.__version__}")

## 2. Verify Setup

In [None]:
# Check GPU and PyTorch setup
import torch
import pytorch_lightning as pl

print(f"PyTorch version: {torch.__version__}")
print(f"Lightning version: {pl.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    print(f"\n✓ GPU ready for training")
else:
    print("\n" + "="*70)
    print("⚠️  CRITICAL: NO GPU DETECTED!")
    print("="*70)
    print("\nTraining on CPU will be EXTREMELY SLOW (200x slower than GPU).")
    print("\nTO ENABLE GPU:")
    print("1. Go to: Runtime → Change runtime type")
    print("2. Set 'Hardware accelerator' to: T4 GPU (or V100)")
    print("3. Click 'Save'")
    print("4. Re-run all cells from the beginning")
    print("\nDO NOT proceed with training until GPU is enabled!")
    print("="*70)
    raise RuntimeError("GPU required for training. Please enable GPU and restart.")

In [None]:
# Test MERT model download (this will cache the model)
from transformers import AutoModel

print("Downloading MERT-95M model (one-time, ~380MB)...")
model = AutoModel.from_pretrained("m-a-p/MERT-v1-95M", trust_remote_code=True)
print(f"✓ MERT-95M loaded: {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M parameters")
del model  # Free memory
torch.cuda.empty_cache()

In [None]:
# Verify MAESTRO dataset exists in Google Drive
import os

maestro_root = f'{data_dir}/maestro'

# Check MAESTRO dataset
assert os.path.exists(maestro_root), f"MAESTRO dataset not found at: {maestro_root}"
assert os.path.exists(f'{maestro_root}/metadata.csv'), f"maestro-v3.0.0.csv not found! Make sure you uploaded the complete MAESTRO dataset."

print(f"✓ MAESTRO dataset found at: {maestro_root}")
print(f"  Contents: {os.listdir(maestro_root)[:10]}")  # Show first 10 items

## Generate Pseudo-Label Annotations

This will create annotation files with correct Google Drive paths. Takes ~15-30 minutes to process 50 pieces.

In [None]:
import os

# Output path for combined annotations (temporary)
temp_annotations = '/tmp/maestro_pseudo_labels_all.jsonl'

# Check if annotations already exist (skip regeneration if so)
train_path = f'{data_dir}/maestro_pseudo_labels_train.jsonl'
val_path = f'{data_dir}/maestro_pseudo_labels_val.jsonl'

if os.path.exists(train_path) and os.path.exists(val_path):
    print("✓ Annotation files already exist, skipping generation.")
    print(f"  Train: {train_path}")
    print(f"  Val: {val_path}")
    print("\nIf you want to regenerate, delete these files from Google Drive and re-run this cell.")
else:
    print("Generating pseudo-labels for MAESTRO dataset...")
    print("Processing ALL 238 pieces (will take ~2-3 hours).\n")
    
    # Generate pseudo-labels for ALL pieces (no limit)
    !python scripts/generate_pseudo_labels.py \
      --maestro-root {maestro_root} \
      --output {temp_annotations} \
      --metadata-csv {maestro_root}/metadata.csv \
      --segment-duration 20.0 \
      --segment-overlap 5.0
    
    print("\n✓ Pseudo-label generation complete!")
    
    # Split annotations into train/val sets (80/20 split)
    print("\nSplitting annotations into train/val sets...")
    !python scripts/split_annotations.py \
      --input {temp_annotations} \
      --train-output {train_path} \
      --val-output {val_path} \
      --train-ratio 0.8 \
      --val-ratio 0.2 \
      --seed 42
    
    print("\n✓ Train/val split complete!")
    print(f"  Train: {train_path}")
    print(f"  Val: {val_path}")

## 3. Prepare Training Configuration

In [None]:
# Create training configuration for Colab
import yaml
import os

# Define config with Google Drive paths
config = {
    'data': {
        'train_path': f'{data_dir}/maestro_pseudo_labels_train.jsonl',
        'val_path': f'{data_dir}/maestro_pseudo_labels_val.jsonl',
        'test_path': None,
        'dimensions': [
            'note_accuracy',
            'rhythmic_precision',
            'dynamics_control',
            'articulation',
            'pedaling',
            'tone_quality'
        ],
        'audio_sample_rate': 24000,
        'max_audio_length': 240000,
        'max_midi_events': 512,
        'batch_size': 8,
        'num_workers': 2,
        'pin_memory': True,
        'augmentation': {
            'enabled': True,
            'pitch_shift': {
                'enabled': True,
                'probability': 0.3,
                'min_semitones': -2,
                'max_semitones': 2
            },
            'time_stretch': {
                'enabled': True,
                'probability': 0.3,
                'min_rate': 0.85,
                'max_rate': 1.15
            },
            'add_noise': {
                'enabled': True,
                'probability': 0.2,
                'min_snr_db': 25,
                'max_snr_db': 40
            },
            'room_acoustics': {
                'enabled': True,
                'probability': 0.2,
                'num_room_types': 5
            },
            'compress_audio': {
                'enabled': True,
                'probability': 0.15,
                'bitrates': [128, 192, 256, 320]
            },
            'gain_variation': {
                'enabled': True,
                'probability': 0.3,
                'min_db': -6,
                'max_db': 6
            },
            'max_transforms': 3
        }
    },
    'model': {
        'audio_dim': 768,
        'midi_dim': 256,
        'fusion_dim': 1024,
        'aggregator_dim': 512,
        'num_dimensions': 6,
        'mert_model_name': 'm-a-p/MERT-v1-95M',
        'freeze_audio_encoder': False,
        'gradient_checkpointing': True,
        'midi_hidden_size': 256,
        'midi_num_layers': 6,
        'midi_num_heads': 4,
        'fusion_num_heads': 8,
        'fusion_dropout': 0.1,
        'lstm_hidden': 256,
        'lstm_layers': 2,
        'attention_heads': 4,
        'aggregator_dropout': 0.2,
        'shared_hidden': 256,
        'task_hidden': 128,
        'mtl_dropout': 0.1
    },
    'training': {
        'max_epochs': 20,
        'precision': 16,
        'optimizer': 'AdamW',
        'learning_rate': 1e-5,
        'backbone_lr': 1e-5,
        'heads_lr': 1e-4,
        'weight_decay': 0.01,
        'scheduler': 'cosine',
        'warmup_steps': 500,
        'min_lr': 1e-6,
        'gradient_clip_val': 1.0,
        'accumulate_grad_batches': 4,
        'val_check_interval': 1.0,
        'limit_val_batches': 1.0
    },
    'callbacks': {
        'checkpoint': {
            'monitor': 'val_loss',
            'mode': 'min',
            'save_top_k': 3,
            'save_last': True,
            'dirpath': f'{checkpoint_dir}/pseudo_pretrain',
            'filename': 'pseudo-{epoch:02d}-{val_loss:.4f}'
        },
        'early_stopping': {
            'monitor': 'val_loss',
            'mode': 'min',
            'patience': 5,
            'min_delta': 0.001
        },
        'lr_monitor': {
            'logging_interval': 'step'
        }
    },
    'logging': {
        'log_every_n_steps': 50,
        'use_wandb': False,
        'wandb_project': 'piano-eval-mvp',
        'wandb_entity': None,
        'wandb_run_name': 'pseudo-pretrain',
        'use_tensorboard': True,
        'tensorboard_logdir': f'{checkpoint_dir}/logs'
    },
    'seed': 42
}

# Save config to temporary file
colab_config_path = '/tmp/colab_config.yaml'
with open(colab_config_path, 'w') as f:
    yaml.dump(config, f, default_flow_style=False, sort_keys=False)

print(f"✓ Training configuration created: {colab_config_path}")
print("\nConfiguration summary:")
print(f"  Batch size: {config['data']['batch_size']}")
print(f"  Gradient accumulation: {config['training']['accumulate_grad_batches']}")
print(f"  Effective batch size: {config['data']['batch_size'] * config['training']['accumulate_grad_batches']}")
print(f"  Max epochs: {config['training']['max_epochs']}")
print(f"  Checkpoint dir: {config['callbacks']['checkpoint']['dirpath']}")

## 4. Train Model

In [None]:
# Run MIDI diagnostics on first 5 files to identify the issue
!python scripts/diagnose_midi.py {data_dir}/maestro_pseudo_labels_train.jsonl --num 5

## MIDI Diagnostics (Run if MIDI loading fails)

If you see many MIDI loading failures, run this diagnostic to identify the root cause:

In [None]:
# Pull latest code fixes (augmentation config parsing, MIDI loading, diagnostics)
%cd /content/crescendai
!git pull origin main
%cd /content/crescendai/model

# Show latest changes
!git log -3 --oneline

## Diagnostic Tools (Run if training issues occur)

These tools help diagnose training problems before running full training:
- **Full diagnostics**: Checks data, model outputs, gradients, and parameter updates
- **Single-dimension training**: Tests basic learning on just one dimension (5 epochs, ~30-60 min)

In [None]:
# Run full diagnostics to check for issues
# This analyzes data, model outputs, gradients, and parameter updates WITHOUT running full training
print("Running comprehensive diagnostics...")
print("This will take 2-3 minutes.\n")

!python scripts/diagnose_training.py --config {colab_config_path}

print("\n" + "="*80)
print("Review the diagnostics above. Look for:")
print("  ✓ Label distributions look reasonable (mean ~50-70, std > 5)")
print("  ✓ Predictions in [0, 100] range")
print("  ✓ Gradients not too small (> 1e-6) or too large (< 100)")
print("  ✓ Parameters are updating (changes > 1e-10)")
print("\nIf you see warnings, check DIAGNOSTICS.md for solutions")
print("="*80)

### Single-Dimension Diagnostic Training (Optional)

If diagnostics show issues or you want to test basic learning, run this simplified training:
- Only trains on **note_accuracy** dimension (1 of 6)
- Only 5 epochs (~30-60 min on T4)
- Higher learning rates for faster convergence
- No augmentation to isolate issues

**Success criteria**:
- Loss decreases over epochs
- MAE < 15 by epoch 5
- Pearson r > 0.3 by epoch 5

In [None]:
# Create diagnostic config (single dimension, 5 epochs)
diagnostic_config = config.copy()
diagnostic_config['data']['dimensions'] = ['note_accuracy']  # Only one dimension
diagnostic_config['data']['augmentation'] = None  # Disable augmentation
diagnostic_config['model']['num_dimensions'] = 1
diagnostic_config['training']['max_epochs'] = 5
diagnostic_config['training']['backbone_lr'] = 2e-5  # Slightly higher for faster learning
diagnostic_config['training']['heads_lr'] = 2e-4
diagnostic_config['training']['warmup_steps'] = 100  # Shorter warmup
diagnostic_config['training']['accumulate_grad_batches'] = 2
diagnostic_config['callbacks']['checkpoint']['dirpath'] = f'{checkpoint_dir}/diagnostic_pretrain'
diagnostic_config['callbacks']['early_stopping']['patience'] = 3
diagnostic_config['logging']['tensorboard_logdir'] = f'{checkpoint_dir}/logs/diagnostic'

# Save diagnostic config
diagnostic_config_path = '/tmp/diagnostic_config.yaml'
with open(diagnostic_config_path, 'w') as f:
    yaml.dump(diagnostic_config, f, default_flow_style=False, sort_keys=False)

print("✓ Diagnostic training configuration created")
print("\nDiagnostic training settings:")
print(f"  Dimensions: {diagnostic_config['data']['dimensions']}")
print(f"  Epochs: {diagnostic_config['training']['max_epochs']}")
print(f"  Augmentation: Disabled")
print(f"  Backbone LR: {diagnostic_config['training']['backbone_lr']}")
print(f"  Heads LR: {diagnostic_config['training']['heads_lr']}")
print("\nRun diagnostic training (uncomment to execute):")
print(f"# !python train.py --config {diagnostic_config_path}")

# Uncomment the line below to run diagnostic training
# !python train.py --config {diagnostic_config_path}

**Note on MIDI Loading Warnings**

You may see warnings like "Failed to load MIDI for ... all the input arrays must have same number of dimensions". This is expected for a small number of malformed MIDI files in the MAESTRO dataset.

The system handles this gracefully:
- Failed MIDI files are skipped (audio-only processing)
- Fusion layer falls back to audio-only mode with zero-padded MIDI features
- Training continues normally without interruption

Typical failure rate: ~2-5 files out of 1,868 segments.

In [None]:
# Full Training
!python train.py --config {colab_config_path}

## 5. Evaluate Trained Model

In [None]:
# Load best checkpoint
import sys
sys.path.insert(0, '/content/crescendai/model')

from src.models.lightning_module import PerformanceEvaluationModel

# Find best checkpoint
checkpoint_path = f'{checkpoint_dir}/pseudo_pretrain'
checkpoints = [f for f in os.listdir(checkpoint_path) if f.endswith('.ckpt') and not f.startswith('last')]
best_ckpt = sorted(checkpoints)[0]  # First by name (lowest val_loss in filename)
best_ckpt_path = os.path.join(checkpoint_path, best_ckpt)

print(f"Loading best checkpoint: {best_ckpt}")
model = PerformanceEvaluationModel.load_from_checkpoint(best_ckpt_path)
model.eval()
model = model.cuda()

print(f"\nModel loaded successfully")
print(f"  Dimensions: {model.dimension_names}")
print(f"  Parameters: {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M")

---

## Troubleshooting

### Session Disconnected
- Re-run cells 1-2 (mount Drive, clone repo)
- Re-run cell 4 (training) - will automatically resume from last checkpoint
- All checkpoints are in Google Drive (persistent)

### Out of Memory (OOM)
- Reduce batch size in config: `config['data']['batch_size'] = 4`
- Increase gradient accumulation: `config['training']['accumulate_grad_batches'] = 8`
- This keeps effective batch size = 4 × 8 = 32

### Slow Training
- Check you have T4 or better GPU (not K80)
- Verify data is in Google Drive (not Colab Files)
- Check num_workers: `config['data']['num_workers'] = 2` (lower if I/O bottleneck)

### MERT Download Fails
- Verify HuggingFace authentication
- Check internet connection
- Try manual download: `huggingface-cli download m-a-p/MERT-v1-95M`
