# Piano Performance Evaluation - Pseudo-Label Pre-training

This notebook trains a baseline model on MAESTRO pseudo-labels.

**Goal**: Production-ready baseline model to compare against future expert labels

**Requirements:**
- Colab Pro (recommended for T4/V100 GPU)
- Google Drive for data and checkpoints
- HuggingFace account for MERT model
- Git repository pushed to GitHub

**Note**: This notebook installs packages directly into Colab's system Python (no venv needed)

---

## Google Drive Setup

```
MyDrive/
  piano_eval_data/
    maestro_pseudo_labels_train.jsonl  # Pseudo-label annotations
    maestro_pseudo_labels_val.jsonl
    maestro/                           # (Optional) Audio/MIDI files if not using URLs
      2004/
        audio/
        midi/
      2006/
        ...
  piano_eval_checkpoints/              # Empty folder (checkpoints will be saved here)
    pseudo_pretrain/
```

## 1. Setup Environment

In [None]:
# Login to HF
import os
os.environ.pop("HF_TOKEN", None)
os.environ.pop("HUGGINGFACEHUB_API_TOKEN", None)
from huggingface_hub import login, HfApi
try:
    import getpass as gp
    raw = gp.getpass("Paste your Hugging Face token (input hidden): ")
    token = raw.decode() if isinstance(raw, (bytes, bytearray)) else raw
    if not isinstance(token, str):
        raise TypeError(f"Unexpected token type: {type(token).__name__}")
    token = token.strip()
    if not token:
        raise ValueError("Empty token provided")
    login(token=token, add_to_git_credential=False)
    who = HfApi().whoami(token=token)
    print(f"Logged in as: {who.get('name') or who.get('email') or 'OK'}")
except Exception as e:
    print(f"[HF Login] getpass flow failed: {e}")
    print("Falling back to interactive login widget...")
    login()
    try:
        who = HfApi().whoami()
        print(f"Logged in as: {who.get('name') or who.get('email') or 'OK'}")
    except Exception as e2:
        print(f"[HF Login] Verification skipped: {e2}")

In [None]:
# Mount Google Drive for data and checkpoints
from google.colab import drive
drive.mount('/content/drive')

# Verify data exists
import os
data_dir = '/content/drive/MyDrive/piano_eval_data'
checkpoint_dir = '/content/drive/MyDrive/piano_eval_checkpoints'

assert os.path.exists(data_dir), f"Data directory not found: {data_dir}"
assert os.path.exists(checkpoint_dir), f"Checkpoint directory not found: {checkpoint_dir}"

print(f"✓ Data directory: {data_dir}")
print(f"  Contents: {os.listdir(data_dir)}")
print(f"✓ Checkpoint directory: {checkpoint_dir}")

In [None]:
REPO_URL = "https://github.com/Jai-Dhiman/crescendai.git"
BRANCH = "main"

# Remove old clone if exists
!rm -rf /content/crescendai

# Clone fresh
!git clone --branch {BRANCH} {REPO_URL} /content/crescendai

# Navigate to model directory
%cd /content/crescendai/model

# Show git status
!git log -1 --oneline
!git status

In [None]:
# Install uv (fast Python package manager)
!curl -LsSf https://astral.sh/uv/install.sh | sh

# Add to PATH for this session
import os
os.environ['PATH'] = f"{os.environ['HOME']}/.cargo/bin:{os.environ['PATH']}"

In [None]:
# Install dependencies directly into system Python (no venv needed in Colab)
# Using uv pip for faster installation
!uv pip install --system -e .

# Verify installation
import os
os.environ['MPLBACKEND'] = 'Agg'

import torch
import pytorch_lightning

print(f"Dependencies installed")
print(f"PyTorch: {torch.__version__}")
print(f"Lightning: {pytorch_lightning.__version__}")

## 2. Verify Setup

In [None]:
# Check GPU and PyTorch setup
import torch
import pytorch_lightning as pl

print(f"PyTorch version: {torch.__version__}")
print(f"Lightning version: {pl.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    print(f"\nGPU ready for training")
else:
    print("\nWARNING: No GPU detected! Go to Runtime > Change runtime type > T4 GPU")

In [None]:
# Test MERT model download (this will cache the model)
from transformers import AutoModel

print("Downloading MERT-95M model (one-time, ~380MB)...")
model = AutoModel.from_pretrained("m-a-p/MERT-v1-95M", trust_remote_code=True)
print(f"✓ MERT-95M loaded: {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M parameters")
del model  # Free memory
torch.cuda.empty_cache()

In [None]:
# Verify annotation files exist
import json

train_path = f'{data_dir}/maestro_pseudo_labels_train.jsonl'
val_path = f'{data_dir}/maestro_pseudo_labels_val.jsonl'

# Check files exist
assert os.path.exists(train_path), f"Train annotations not found: {train_path}"
assert os.path.exists(val_path), f"Val annotations not found: {val_path}"

# Count samples
with open(train_path, 'r') as f:
    train_count = sum(1 for line in f if line.strip())
with open(val_path, 'r') as f:
    val_count = sum(1 for line in f if line.strip())

print(f"✓ Train annotations: {train_count} segments")
print(f"✓ Val annotations: {val_count} segments")

# Show sample annotation
with open(train_path, 'r') as f:
    sample = json.loads(f.readline())
print(f"\nSample annotation:")
print(f"  Audio: {sample['audio_path']}")
print(f"  MIDI: {sample.get('midi_path', 'N/A')}")
print(f"  Duration: {sample.get('end_time', 0) - sample.get('start_time', 0):.1f}s")
print(f"  Labels: {list(sample['labels'].keys())}")

## 3. Prepare Training Configuration

In [None]:
# Load base config and update paths for Colab
import yaml

config_path = 'configs/pseudo_pretrain.yaml'
with open(config_path, 'r') as f:
    config = yaml.safe_load(f)

# Update data paths to Google Drive
config['data']['train_path'] = f'{data_dir}/maestro_pseudo_labels_train.jsonl'
config['data']['val_path'] = f'{data_dir}/maestro_pseudo_labels_val.jsonl'
config['data']['test_path'] = None  # No test set for pseudo-label training

# Update checkpoint directory to Google Drive (for persistence)
config['callbacks']['checkpoint']['dirpath'] = f'{checkpoint_dir}/pseudo_pretrain'

# Update logging directory (local is fine, checkpoints are what matter)
config['logging']['tensorboard_logdir'] = 'logs/pseudo_pretrain'

# Optionally enable WandB for experiment tracking
config['logging']['use_wandb'] = False  # Set to True if you want WandB logging

# Save updated config
colab_config_path = 'configs/pseudo_pretrain_colab.yaml'
os.makedirs('configs', exist_ok=True)
with open(colab_config_path, 'w') as f:
    yaml.dump(config, f, default_flow_style=False, sort_keys=False)

print("✓ Config updated for Colab:")
print(f"  Train: {config['data']['train_path']}")
print(f"  Val: {config['data']['val_path']}")
print(f"  Checkpoints: {config['callbacks']['checkpoint']['dirpath']}")
print(f"  Max epochs: {config['training']['max_epochs']}")
print(f"  Batch size: {config['data']['batch_size']}")
print(f"  Dimensions: {config['data']['dimensions']}")

## 4. Train Model

This will take approximately **12 GPU hours** on a T4.

**Important**: 
- Checkpoints are saved to Google Drive every epoch
- If Colab disconnects, re-run this cell - training will resume from last checkpoint
- Early stopping will trigger if validation loss doesn't improve for 5 epochs

In [None]:
# Test setup with fast dev run
!python train.py --config {colab_config_path} --fast-dev-run

In [None]:
# Run training
!python train.py --config {colab_config_path}

## 5. Evaluate Trained Model

In [None]:
# Load best checkpoint
import sys
sys.path.insert(0, '/content/crescendai/model')

from src.models.lightning_module import PerformanceEvaluationModel

# Find best checkpoint
checkpoint_path = f'{checkpoint_dir}/pseudo_pretrain'
checkpoints = [f for f in os.listdir(checkpoint_path) if f.endswith('.ckpt') and not f.startswith('last')]
best_ckpt = sorted(checkpoints)[0]  # First by name (lowest val_loss in filename)
best_ckpt_path = os.path.join(checkpoint_path, best_ckpt)

print(f"Loading best checkpoint: {best_ckpt}")
model = PerformanceEvaluationModel.load_from_checkpoint(best_ckpt_path)
model.eval()
model = model.cuda()

print(f"\nModel loaded successfully")
print(f"  Dimensions: {model.dimension_names}")
print(f"  Parameters: {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M")

In [None]:
# Test inference on a validation sample
import torch
import json
from src.data.audio_processing import load_audio, normalize_audio
from src.data.midi_processing import load_midi, encode_octuple_midi, align_midi_to_audio
import numpy as np

# Load a random validation sample
val_path = f'{data_dir}/maestro_pseudo_labels_val.jsonl'
with open(val_path, 'r') as f:
    annotations = [json.loads(line) for line in f if line.strip()]
sample = annotations[0]  # First validation sample

print(f"Testing on: {sample['audio_path']}")
print(f"Segment: {sample.get('start_time', 0):.1f}s - {sample.get('end_time', 0):.1f}s")

# Load audio
audio, sr = load_audio(sample['audio_path'], sr=24000)

# Extract segment
if 'start_time' in sample and 'end_time' in sample:
    start_sample = int(sample['start_time'] * sr)
    end_sample = int(sample['end_time'] * sr)
    audio = audio[start_sample:end_sample]

# Normalize
audio = normalize_audio(audio)

# Pad/truncate to 10 seconds
max_length = 240000
if len(audio) > max_length:
    audio = audio[:max_length]
elif len(audio) < max_length:
    audio = np.pad(audio, (0, max_length - len(audio)))

# Convert to tensor
audio_tensor = torch.from_numpy(audio).float().unsqueeze(0).cuda()  # [1, samples]

# Forward pass
with torch.no_grad():
    output = model(audio_waveform=audio_tensor, midi_tokens=None)

# Extract predictions
scores = output['scores'][0].cpu().numpy()
uncertainties = output['uncertainties'].cpu().numpy()
ground_truth = [sample['labels'][dim] for dim in model.dimension_names]

# Print results
print("\nPredictions vs Ground Truth (Pseudo-labels):")
print(f"{'Dimension':<25s} {'Predicted':<12s} {'Ground Truth':<12s} {'Error':<8s}")
print("-" * 65)
for dim_name, pred, gt, unc in zip(model.dimension_names, scores, ground_truth, uncertainties):
    error = abs(pred - gt)
    print(f"{dim_name:<25s} {pred:5.1f} ± {unc:4.2f}  {gt:5.1f}         {error:5.1f}")

mae = np.mean([abs(p - g) for p, g in zip(scores, ground_truth)])
print(f"\nMean Absolute Error: {mae:.2f}")

## 7. Save Results & Next Steps

In [None]:
# Check final checkpoint sizes
checkpoint_path = f'{checkpoint_dir}/pseudo_pretrain'
total_size = 0
for f in os.listdir(checkpoint_path):
    if f.endswith('.ckpt'):
        size = os.path.getsize(os.path.join(checkpoint_path, f))
        total_size += size

print(f"✓ Training complete!")
print(f"\nCheckpoints saved to: {checkpoint_path}")
print(f"Total size: {total_size / 1e6:.1f} MB")
print(f"\nBest model: {best_ckpt}")
print(f"\nThis model is trained on pseudo-labels only.")

---

## Troubleshooting

### Session Disconnected
- Re-run cells 1-2 (mount Drive, clone repo)
- Re-run cell 4 (training) - will automatically resume from last checkpoint
- All checkpoints are in Google Drive (persistent)

### Out of Memory (OOM)
- Reduce batch size in config: `config['data']['batch_size'] = 4`
- Increase gradient accumulation: `config['training']['accumulate_grad_batches'] = 8`
- This keeps effective batch size = 4 × 8 = 32

### Slow Training
- Check you have T4 or better GPU (not K80)
- Verify data is in Google Drive (not Colab Files)
- Check num_workers: `config['data']['num_workers'] = 2` (lower if I/O bottleneck)

### MERT Download Fails
- Verify HuggingFace authentication
- Check internet connection
- Try manual download: `huggingface-cli download m-a-p/MERT-v1-95M`
