# Training on Expanded Dataset (Real + Pseudo Labels)

**Goal**: Improve model performance by training on expanded dataset with pseudo-labeled MAESTRO.

## Prerequisites

1. **Trained teacher model** from `train_percepiano_replica.ipynb` (R^2 >= 0.25)
2. **Pseudo-labeled MAESTRO** from `scripts/pseudo_label_maestro.py`

## Dataset Composition

| Source | Samples | Labels |
|--------|---------|--------|
| PercePiano (real) | 955 | Expert annotations |
| MAESTRO (pseudo) | ~5000+ | Teacher predictions |
| **Total** | ~6000+ | Mixed |

## Training Strategy

1. **Sample weighting**: Real labels weighted 1.0, pseudo labels weighted by confidence
2. **Confidence filtering**: Only include pseudo samples with confidence >= 0.5
3. **Validation on real only**: Evaluate on PercePiano val/test (real labels)

## Expected Results

With expanded data, we expect:
- Reduced overfitting (more data)
- Potentially higher R^2 on test set
- Better generalization to unseen performances

## Step 1: Environment Setup

In [None]:
# Check GPU
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
# Install rclone
!curl -fsSL https://rclone.org/install.sh | sudo bash 2>&1 | grep -E "(successfully|already)" || echo "rclone installed"

In [None]:
# Install uv and clone repository
!curl -LsSf https://astral.sh/uv/install.sh | sh

import os
os.environ['PATH'] = f"{os.environ['HOME']}/.cargo/bin:{os.environ['PATH']}"

if not os.path.exists('/tmp/crescendai'):
    !git clone https://github.com/Jai-Dhiman/crescendai.git /tmp/crescendai

%cd /tmp/crescendai/model
!git pull
!git log -1 --oneline

!uv pip install --system -e .
!pip install tensorboard rich

import pytorch_lightning as pl
print(f"\nPyTorch: {torch.__version__}")
print(f"Lightning: {pl.__version__}")

In [None]:
import os
from pathlib import Path
import subprocess

# Paths
CHECKPOINT_ROOT = '/tmp/checkpoints/expanded_training'
GDRIVE_CHECKPOINT_PATH = 'gdrive:crescendai_checkpoints/expanded_training'
GDRIVE_DATA_PATH = 'gdrive:percepiano_data'
GDRIVE_PSEUDO_PATH = 'gdrive:crescendai_checkpoints/pseudo_labels'
DATA_ROOT = Path('/tmp/percepiano_data')
PSEUDO_ROOT = Path('/tmp/pseudo_labels')

print("="*70)
print("EXPANDED DATASET TRAINING")
print("="*70)
print("Training on: PercePiano (real) + MAESTRO (pseudo)")
print("="*70)

# Create directories
os.makedirs(CHECKPOINT_ROOT, exist_ok=True)
DATA_ROOT.mkdir(parents=True, exist_ok=True)
PSEUDO_ROOT.mkdir(parents=True, exist_ok=True)

# Check rclone
result = subprocess.run(['rclone', 'listremotes'], capture_output=True, text=True)
RCLONE_AVAILABLE = 'gdrive:' in result.stdout
print(f"rclone available: {RCLONE_AVAILABLE}")

## Step 2: Download Data

In [None]:
import subprocess
import json

# Download PercePiano data
if not (DATA_ROOT / 'percepiano_train.json').exists():
    print("Downloading PercePiano data...")
    subprocess.run(['rclone', 'copy', GDRIVE_DATA_PATH, str(DATA_ROOT), '--progress'])
else:
    print("PercePiano data already exists")

# Download pseudo labels
pseudo_file = PSEUDO_ROOT / 'maestro_pseudo_train.json'
if not pseudo_file.exists():
    print("\nDownloading pseudo labels...")
    subprocess.run(['rclone', 'copy', GDRIVE_PSEUDO_PATH, str(PSEUDO_ROOT), '--progress'])
else:
    print("Pseudo labels already exist")

# Verify data
print("\n" + "="*60)
print("DATA VERIFICATION")
print("="*60)

# Real labels
for split in ['train', 'val', 'test']:
    path = DATA_ROOT / f'percepiano_{split}.json'
    if path.exists():
        with open(path) as f:
            data = json.load(f)
        print(f"PercePiano {split}: {len(data)} samples")

# Pseudo labels
if pseudo_file.exists():
    with open(pseudo_file) as f:
        pseudo_data = json.load(f)
    print(f"\nMAESTRO pseudo: {len(pseudo_data)} samples")
    
    # Confidence distribution
    confidences = [s.get('confidence', 0.5) for s in pseudo_data]
    print(f"  Mean confidence: {sum(confidences)/len(confidences):.3f}")
    print(f"  Min confidence: {min(confidences):.3f}")
    print(f"  Max confidence: {max(confidences):.3f}")
else:
    print("\nWARNING: Pseudo labels not found!")
    print("Run pseudo_label_maestro.py first, or continue with real labels only")
    pseudo_file = None

In [None]:
# Update paths for Thunder Compute
import json
from pathlib import Path

MIDI_DIR = DATA_ROOT / 'PercePiano' / 'virtuoso' / 'data' / 'all_2rounds'
SCORE_DIR = DATA_ROOT / 'PercePiano' / 'virtuoso' / 'data' / 'score_xml'

PERCEPIANO_DIMENSIONS = [
    "timing", "articulation_length", "articulation_touch",
    "pedal_amount", "pedal_clarity", "timbre_variety",
    "timbre_depth", "timbre_brightness", "timbre_loudness",
    "dynamic_range", "tempo", "space", "balance", "drama",
    "mood_valence", "mood_energy", "mood_imagination",
    "sophistication", "interpretation",
]

for split in ['train', 'val', 'test']:
    path = DATA_ROOT / f'percepiano_{split}.json'
    with open(path) as f:
        data = json.load(f)
    
    for sample in data:
        filename = Path(sample['midi_path']).name
        sample['midi_path'] = str(MIDI_DIR / filename)
        
        if 'percepiano_scores' in sample:
            pp_scores = sample['percepiano_scores'][:19]
            sample['scores'] = {
                dim: pp_scores[i] for i, dim in enumerate(PERCEPIANO_DIMENSIONS)
            }
    
    with open(path, 'w') as f:
        json.dump(data, f, indent=2)
    print(f"Updated {split}")

print("\nPaths updated for Thunder Compute")

## Step 3: Training Configuration

In [None]:
import torch
torch.set_float32_matmul_precision('medium')

CONFIG = {
    # Data
    'data_dir': str(DATA_ROOT),
    'score_dir': str(SCORE_DIR),
    'pseudo_data_path': str(pseudo_file) if pseudo_file else None,
    
    # Pseudo-label settings
    'pseudo_weight': 0.5,           # Weight for pseudo labels (real = 1.0)
    'min_pseudo_confidence': 0.5,   # Minimum confidence to include
    'use_weighted_sampling': True,  # Sample by weight
    
    # Model (PercePiano replica architecture)
    'hidden_size': 256,
    'note_layers': 2,
    'voice_layers': 2,
    'beat_layers': 2,
    'measure_layers': 1,
    'num_attention_heads': 8,
    'final_hidden': 128,
    
    # Training
    'learning_rate': 2.5e-5,
    'weight_decay': 0.01,
    'dropout': 0.2,
    'batch_size': 8,
    'max_epochs': 100,
    'early_stopping_patience': 20,
    'gradient_clip_val': 1.0,
    'precision': '16-mixed',
    
    # Sequence limits
    'max_score_notes': 1024,
    'max_midi_seq_length': 512,
    
    # Checkpoints
    'checkpoint_dir': CHECKPOINT_ROOT,
    'gdrive_checkpoint': GDRIVE_CHECKPOINT_PATH,
}

print("="*70)
print("EXPANDED TRAINING CONFIGURATION")
print("="*70)
for k, v in CONFIG.items():
    print(f"  {k}: {v}")
print("="*70)

## Step 4: Create Mixed DataLoaders

In [None]:
from pathlib import Path
from src.data.mixed_dataset import create_mixed_dataloaders

train_loader, val_loader, test_loader = create_mixed_dataloaders(
    real_data_dir=Path(CONFIG['data_dir']),
    pseudo_data_path=Path(CONFIG['pseudo_data_path']) if CONFIG['pseudo_data_path'] else None,
    score_dir=Path(CONFIG['score_dir']),
    batch_size=CONFIG['batch_size'],
    max_midi_seq_length=CONFIG['max_midi_seq_length'],
    max_score_notes=CONFIG['max_score_notes'],
    pseudo_weight=CONFIG['pseudo_weight'],
    min_pseudo_confidence=CONFIG['min_pseudo_confidence'],
    num_workers=4,
    use_weighted_sampling=CONFIG['use_weighted_sampling'],
)

print(f"\nDataLoader sizes:")
print(f"  Train: {len(train_loader)} batches")
print(f"  Val: {len(val_loader)} batches")
print(f"  Test: {len(test_loader)} batches")

# Test batch
batch = next(iter(train_loader))
print(f"\nBatch contents:")
print(f"  score_note_features: {batch['score_note_features'].shape}")
print(f"  scores: {batch['scores'].shape}")
print(f"  is_pseudo_label: {batch['is_pseudo_label'].sum().item()}/{len(batch['is_pseudo_label'])} pseudo")
print(f"  sample_weights: {batch['sample_weight'].mean().item():.3f} (mean)")

## Step 5: Create Model

In [None]:
from src.models.percepiano_replica import PercePianoReplicaModule

model = PercePianoReplicaModule(
    score_note_features=20,
    score_global_features=12,
    hidden_size=CONFIG['hidden_size'],
    note_layers=CONFIG['note_layers'],
    voice_layers=CONFIG['voice_layers'],
    beat_layers=CONFIG['beat_layers'],
    measure_layers=CONFIG['measure_layers'],
    num_attention_heads=CONFIG['num_attention_heads'],
    final_hidden=CONFIG['final_hidden'],
    learning_rate=CONFIG['learning_rate'],
    weight_decay=CONFIG['weight_decay'],
    dropout=CONFIG['dropout'],
)

print("="*70)
print("MODEL SUMMARY")
print("="*70)
print(f"Architecture: PercePiano Replica (Bi-LSTM + HAN)")
print(f"Parameters: {model.count_parameters():,}")
print(f"Dimensions: {len(model.dimensions)}")
print("="*70)

## Step 6: Configure Trainer

In [None]:
import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping, LearningRateMonitor
from pytorch_lightning.loggers import TensorBoardLogger

checkpoint_callback = ModelCheckpoint(
    dirpath=CONFIG['checkpoint_dir'],
    filename='expanded-{epoch:02d}-{val_mean_r2:.4f}',
    monitor='val/mean_r2',
    mode='max',
    save_top_k=3,
    save_last=True,
)

early_stopping = EarlyStopping(
    monitor='val/mean_r2',
    patience=CONFIG['early_stopping_patience'],
    mode='max',
)

lr_monitor = LearningRateMonitor(logging_interval='step')

logger = TensorBoardLogger(
    save_dir='/tmp/logs',
    name='expanded_training',
)

trainer = pl.Trainer(
    max_epochs=CONFIG['max_epochs'],
    accelerator='gpu',
    devices=1,
    precision=CONFIG['precision'],
    gradient_clip_val=CONFIG['gradient_clip_val'],
    callbacks=[checkpoint_callback, early_stopping, lr_monitor],
    logger=logger,
    log_every_n_steps=10,
    val_check_interval=0.5,
)

print("Trainer configured!")

## Step 7: Train!

In [None]:
pl.seed_everything(42, workers=True)

print("="*70)
print("TRAINING ON EXPANDED DATASET")
print("="*70)
print(f"Real samples: {len(train_loader.dataset.real_samples)}")
print(f"Pseudo samples: {len(train_loader.dataset.pseudo_samples)}")
print(f"Total: {len(train_loader.dataset)}")
print("="*70)

trainer.fit(model, train_loader, val_loader)

In [None]:
# Sync checkpoints
if RCLONE_AVAILABLE:
    print("Syncing checkpoints to Google Drive...")
    subprocess.run(['rclone', 'copy', CONFIG['checkpoint_dir'], CONFIG['gdrive_checkpoint'], '--progress'])

## Step 8: Evaluation

In [None]:
# Test with best checkpoint
best_path = checkpoint_callback.best_model_path
print(f"Best checkpoint: {best_path}")

if best_path:
    test_results = trainer.test(model, test_loader, ckpt_path=best_path)
    print("\nTest Results:")
    for k, v in test_results[0].items():
        print(f"  {k}: {v:.4f}")

In [None]:
import torch
import numpy as np
from src.models.percepiano_replica import PercePianoReplicaModule

# Load best model
best_model = PercePianoReplicaModule.load_from_checkpoint(best_path)
best_model.eval()
best_model.cuda()

# Collect predictions
all_preds = []
all_targets = []

with torch.no_grad():
    for batch in test_loader:
        batch = {k: v.cuda() if isinstance(v, torch.Tensor) else v for k, v in batch.items()}
        note_locations = {
            'beat': batch['note_locations_beat'],
            'measure': batch['note_locations_measure'],
            'voice': batch['note_locations_voice'],
        }
        outputs = best_model(
            batch['score_note_features'],
            batch['score_global_features'],
            batch['score_tempo_curve'],
            note_locations,
        )
        all_preds.append(outputs['predictions'].cpu())
        all_targets.append(batch['scores'].cpu())

all_preds = torch.cat(all_preds).numpy()
all_targets = torch.cat(all_targets).numpy()
dimensions = best_model.dimensions

print(f"Collected {len(all_preds)} test samples")

In [None]:
from src.evaluation import (
    compute_all_metrics,
    compare_to_sota,
    format_comparison_table,
)

metrics = compute_all_metrics(
    predictions=all_preds,
    targets=all_targets,
    dimension_names=list(dimensions),
)

our_r2 = metrics['r2'].value
per_dim_r2 = metrics['r2'].per_dimension

comparison = compare_to_sota(
    model_r2=our_r2,
    model_name="Expanded Training (Real + Pseudo)",
    split_type="piece",
    per_dimension_r2=per_dim_r2,
)

print(format_comparison_table(comparison))

In [None]:
# Summary
print("="*70)
print("EXPANDED TRAINING - RESULTS SUMMARY")
print("="*70)

print(f"\n1. DATASET")
print(f"   Real samples (PercePiano): {len(train_loader.dataset.real_samples)}")
print(f"   Pseudo samples (MAESTRO): {len(train_loader.dataset.pseudo_samples)}")
print(f"   Expansion factor: {len(train_loader.dataset) / len(train_loader.dataset.real_samples):.1f}x")

print(f"\n2. PERFORMANCE")
print(f"   Mean R^2: {our_r2:.4f}")
print(f"   Target (0.35-0.40): {'ACHIEVED' if our_r2 >= 0.35 else 'CLOSE' if our_r2 >= 0.30 else 'NOT YET'}")

print(f"\n3. COMPARISON")
print(f"   Teacher model R^2: {train_loader.dataset.pseudo_samples[0].get('teacher_r2', 'N/A') if train_loader.dataset.pseudo_samples else 'N/A'}")
print(f"   Expanded model R^2: {our_r2:.4f}")
print(f"   Improvement: {'+' if our_r2 > 0.35 else ''}{(our_r2 - 0.35)*100:.1f}% vs target")

print(f"\n4. TOP 5 DIMENSIONS")
sorted_dims = sorted(per_dim_r2.items(), key=lambda x: x[1], reverse=True)
for dim, r2 in sorted_dims[:5]:
    print(f"   {dim}: {r2:.4f}")

print("="*70)

## Step 9: Save Final Model

In [None]:
import torch

# Save final model
final_path = Path(CONFIG['checkpoint_dir']) / 'expanded_final.pt'

torch.save({
    'state_dict': best_model.state_dict(),
    'hparams': dict(best_model.hparams),
    'dimensions': list(dimensions),
    'metrics': {
        'r2': our_r2,
        'per_dimension_r2': per_dim_r2,
    },
    'training_info': {
        'real_samples': len(train_loader.dataset.real_samples),
        'pseudo_samples': len(train_loader.dataset.pseudo_samples),
        'pseudo_weight': CONFIG['pseudo_weight'],
        'min_pseudo_confidence': CONFIG['min_pseudo_confidence'],
    },
}, final_path)

print(f"Saved final model to {final_path}")

# Sync to GDrive
if RCLONE_AVAILABLE:
    subprocess.run(['rclone', 'copy', CONFIG['checkpoint_dir'], CONFIG['gdrive_checkpoint'], '--progress'])
    print(f"Synced to {CONFIG['gdrive_checkpoint']}")

## Next Steps

If R^2 improved:
1. **Iterate**: Use this model as new teacher, pseudo-label more data
2. **Noisy Student**: Add augmentation for student training
3. **Scale up**: Try larger model now that we have more data

If R^2 didn't improve:
1. Reduce pseudo_weight (try 0.3)
2. Increase min_pseudo_confidence (try 0.7)
3. Check if teacher model was good enough (R^2 >= 0.25)

---

**Attribution**: Based on PercePiano (Park et al., ISMIR/Nature 2024)