# Baseline Model Validation - Synthetic Labels

This notebook trains and evaluates baseline models to validate:
1. Can the architecture learn patterns from synthetic labels?
2. Does audio+MIDI outperform audio-only?
3. How do different data sources (MAESTRO vs YouTube) compare?

**Target**: Pearson r = 0.48-0.55 on technical dimensions OR evidence of clear learning

**Requirements:**
- Colab Pro (T4/V100 GPU recommended)
- Google Drive for data and checkpoints
- HuggingFace account for MERT model
- Git repository access

## 1. Environment Setup

In [None]:
# Login to Hugging Face
import os
os.environ.pop("HF_TOKEN", None)
os.environ.pop("HUGGINGFACEHUB_API_TOKEN", None)
from huggingface_hub import login, HfApi

try:
    import getpass as gp
    raw = gp.getpass("Paste your Hugging Face token (input hidden): ")
    token = raw.decode() if isinstance(raw, (bytes, bytearray)) else raw
    if not isinstance(token, str):
        raise TypeError(f"Unexpected token type: {type(token).__name__}")
    token = token.strip()
    if not token:
        raise ValueError("Empty token provided")
    login(token=token, add_to_git_credential=False)
    who = HfApi().whoami(token=token)
    print(f"‚úì Logged in as: {who.get('name') or who.get('email') or 'OK'}")
except Exception as e:
    print(f"[HF Login] getpass flow failed: {e}")
    print("Falling back to interactive login widget...")
    login()
    try:
        who = HfApi().whoami()
        print(f"‚úì Logged in as: {who.get('name') or who.get('email') or 'OK'}")
    except Exception as e2:
        print(f"[HF Login] Verification skipped: {e2}")

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Setup paths
data_dir = '/content/drive/MyDrive/piano_eval_data'
checkpoint_dir = '/content/drive/MyDrive/piano_eval_checkpoints'

print(f"‚úì Data directory: {data_dir}")
print(f"‚úì Checkpoint directory: {checkpoint_dir}")

In [None]:
# Clone repository
REPO_URL = "https://github.com/Jai-Dhiman/crescendai.git"
BRANCH = "main"

!rm -rf /content/crescendai
!git clone --branch {BRANCH} {REPO_URL} /content/crescendai
%cd /content/crescendai/model

!git log -1 --oneline
print(f"\n‚úì Repository cloned")

In [None]:
# Install uv and dependencies
!curl -LsSf https://astral.sh/uv/install.sh | sh

import os
os.environ['PATH'] = f"{os.environ['HOME']}/.cargo/bin:{os.environ['PATH']}"

!uv pip install --system -e .

import torch
import pytorch_lightning
print(f"\n‚úì Dependencies installed")
print(f"  PyTorch: {torch.__version__}")
print(f"  Lightning: {pytorch_lightning.__version__}")

## 2. Verify GPU Setup

In [None]:
import torch

print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    print(f"\n‚úì GPU ready for training")
else:
    print("\n‚ö†Ô∏è WARNING: NO GPU DETECTED!")
    print("Enable GPU: Runtime ‚Üí Change runtime type ‚Üí T4 GPU")
    raise RuntimeError("GPU required for training")

## 3. Verify Data Files

**Upload these files to Google Drive before running:**

To `{data_dir}/annotations/`:
- `synthetic_train.jsonl`
- `synthetic_val.jsonl`
- `synthetic_test.jsonl`

To `{data_dir}/all_segments/`:
- `maestro_001.wav` through `maestro_100.wav`
- `youtube_*.wav` files (at least 96 files for test set)

In [None]:
# Verify data exists
import os

annotations_dir = f'{data_dir}/annotations'
segments_dir = f'{data_dir}/all_segments'

# Check annotations
assert os.path.exists(f'{annotations_dir}/synthetic_train.jsonl'), "synthetic_train.jsonl not found!"
assert os.path.exists(f'{annotations_dir}/synthetic_val.jsonl'), "synthetic_val.jsonl not found!"
assert os.path.exists(f'{annotations_dir}/synthetic_test.jsonl'), "synthetic_test.jsonl not found!"

print(f"‚úì Annotation files found")

# Count audio files
maestro_files = len([f for f in os.listdir(segments_dir) if f.startswith('maestro_')])
youtube_files = len([f for f in os.listdir(segments_dir) if f.startswith('youtube_')])

print(f"‚úì Audio segments found:")
print(f"  MAESTRO: {maestro_files}")
print(f"  YouTube: {youtube_files}")

if maestro_files < 100:
    print(f"\n‚ö†Ô∏è Warning: Expected 100 MAESTRO files, found {maestro_files}")
if youtube_files < 96:
    print(f"\n‚ö†Ô∏è Warning: Expected ~100 YouTube files, found {youtube_files}")

## 4. Train Audio-Only Baseline (Experiment A)

**Start here**: Train audio-only model first to validate the approach.

This takes ~2-3 GPU hours on T4.

In [None]:
# Update audio-only config with Google Drive paths
audioonly_config_path = '/content/crescendai/model/configs/baseline_audioonly.yaml'

import yaml
with open(audioonly_config_path, 'r') as f:
    config_audioonly = yaml.safe_load(f)

# Update paths
config_audioonly['data']['train_path'] = f'{annotations_dir}/synthetic_train.jsonl'
config_audioonly['data']['val_path'] = f'{annotations_dir}/synthetic_val.jsonl'
config_audioonly['data']['test_path'] = f'{annotations_dir}/synthetic_test.jsonl'
config_audioonly['callbacks']['checkpoint']['dirpath'] = f'{checkpoint_dir}/baseline_audioonly'
config_audioonly['logging']['tensorboard_logdir'] = f'{checkpoint_dir}/logs/baseline_audioonly'

# Save updated config
colab_audioonly_path = '/tmp/baseline_audioonly_colab.yaml'
with open(colab_audioonly_path, 'w') as f:
    yaml.dump(config_audioonly, f, default_flow_style=False)

print(f"‚úì Audio-only configuration updated")
print(f"\nExperiment: Audio-Only Baseline")
print(f"  Epochs: {config_audioonly['training']['max_epochs']}")
print(f"  MIDI: Disabled")
print(f"  Fusion: Disabled")

In [None]:
# Train audio-only model
print("="*80)
print("TRAINING AUDIO-ONLY MODEL (EXPERIMENT A)")
print("="*80)
print(f"Expected duration: ~2-3 GPU hours on T4\n")

!python train.py --config {colab_audioonly_path}

print("\n" + "="*80)
print("‚úì Audio-only training complete!")
print("="*80)

## 5. Train Audio+MIDI Baseline (Experiment B - Optional)

**Only run this if:**
1. Audio-only training showed promising results (will check in evaluation)
2. You have MIDI files uploaded to Google Drive

This takes ~3-4 GPU hours on T4.

**Skip this section if you don't have MIDI files.**

In [None]:
# Update audio+MIDI config with Google Drive paths
config_path = '/content/crescendai/model/configs/baseline_synthetic.yaml'

with open(config_path, 'r') as f:
    config = yaml.safe_load(f)

# Update paths
config['data']['train_path'] = f'{annotations_dir}/synthetic_train.jsonl'
config['data']['val_path'] = f'{annotations_dir}/synthetic_val.jsonl'
config['data']['test_path'] = f'{annotations_dir}/synthetic_test.jsonl'
config['callbacks']['checkpoint']['dirpath'] = f'{checkpoint_dir}/baseline_synthetic'
config['logging']['tensorboard_logdir'] = f'{checkpoint_dir}/logs/baseline_synthetic'

# Save updated config
colab_config_path = '/tmp/baseline_synthetic_colab.yaml'
with open(colab_config_path, 'w') as f:
    yaml.dump(config, f, default_flow_style=False)

print(f"‚úì Configuration updated for Colab")
print(f"\nExperiment: Audio+MIDI Baseline")
print(f"  Epochs: {config['training']['max_epochs']}")
print(f"  Batch size: {config['data']['batch_size']}")
print(f"  Dimensions: {config['data']['dimensions']}")

In [None]:
# Train audio+MIDI model
print("="*80)
print("TRAINING AUDIO+MIDI MODEL (EXPERIMENT B)")
print("="*80)
print(f"Expected duration: ~3-4 GPU hours on T4\n")

!python train.py --config {colab_config_path}

print("\n" + "="*80)
print("‚úì Audio+MIDI training complete!")
print("="*80)

## 6. Evaluation - Audio-Only Model

In [None]:
# Load audio-only model
import sys
sys.path.insert(0, '/content/crescendai/model')

import torch
import pytorch_lightning as pl
from src.models.lightning_module import PerformanceEvaluationModel
import pandas as pd
import numpy as np
from scipy.stats import pearsonr

def get_best_checkpoint(checkpoint_dir):
    checkpoints = [f for f in os.listdir(checkpoint_dir) if f.endswith('.ckpt') and 'last' not in f]
    if not checkpoints:
        raise ValueError(f"No checkpoints found in {checkpoint_dir}")
    return os.path.join(checkpoint_dir, sorted(checkpoints)[0])

audioonly_ckpt = get_best_checkpoint(f'{checkpoint_dir}/baseline_audioonly')

print(f"Loading audio-only model...")
print(f"  Checkpoint: {os.path.basename(audioonly_ckpt)}")

model_audioonly = PerformanceEvaluationModel.load_from_checkpoint(audioonly_ckpt)
model_audioonly.eval().cuda()

print(f"\n‚úì Model loaded")

In [None]:
# Create test dataloader
from src.data.dataset import PerformanceDataset, collate_fn
from torch.utils.data import DataLoader

test_dataset = PerformanceDataset(
    annotation_path=f'{annotations_dir}/synthetic_test.jsonl',
    dimension_names=config_audioonly['data']['dimensions'],
    audio_sample_rate=24000,
    max_audio_length=240000,
    max_midi_events=0,  # Audio-only
    augmentation_config=None,
    apply_augmentation=False,
)

test_loader = DataLoader(
    test_dataset,
    batch_size=8,
    shuffle=False,
    num_workers=2,
    collate_fn=collate_fn,
    pin_memory=True,
)

print(f"‚úì Test dataloader created ({len(test_dataset)} samples)")

In [None]:
# Evaluate audio-only model
def evaluate_model(model, dataloader, device='cuda'):
    model.eval()
    all_preds = []
    all_targets = []

    with torch.no_grad():
        for batch in dataloader:
            audio = batch['audio_waveform'].to(device)
            midi = batch.get('midi_tokens')
            if midi is not None:
                midi = midi.to(device)
            targets = batch['labels'].to(device)

            output = model(audio, midi)
            preds = output['scores']

            all_preds.append(preds.cpu().numpy())
            all_targets.append(targets.cpu().numpy())

    all_preds = np.concatenate(all_preds, axis=0)
    all_targets = np.concatenate(all_targets, axis=0)

    return all_preds, all_targets

print("Evaluating audio-only model on test set...")
preds_audioonly, targets = evaluate_model(model_audioonly, test_loader)

print(f"\n‚úì Evaluation complete")
print(f"  Test samples: {len(targets)}")
print(f"  Dimensions: {len(config_audioonly['data']['dimensions'])}")

In [None]:
# Compute correlations for audio-only
dimension_names = config_audioonly['data']['dimensions']

results = []

for i, dim_name in enumerate(dimension_names):
    r_audioonly, p_audioonly = pearsonr(targets[:, i], preds_audioonly[:, i])
    mae_audioonly = np.abs(targets[:, i] - preds_audioonly[:, i]).mean()

    results.append({
        'dimension': dim_name,
        'audioonly_r': r_audioonly,
        'audioonly_mae': mae_audioonly,
    })

results_df = pd.DataFrame(results)

print("\n" + "="*80)
print("AUDIO-ONLY RESULTS - Test Set")
print("="*80)
print(results_df.to_string(index=False))
print("="*80)

avg_r = results_df['audioonly_r'].mean()
print(f"\nSummary:")
print(f"  Mean Pearson r: {avg_r:.3f}")
print(f"  Mean MAE: {results_df['audioonly_mae'].mean():.2f}")

In [None]:
# Check success criteria
print(f"\n" + "="*80)
print("SUCCESS CRITERIA EVALUATION")
print("="*80)

print(f"\n1. Feasibility (Can model learn patterns?)")
print(f"   {'‚úì' if avg_r > 0.2 else '‚úó'} Mean r = {avg_r:.3f} {'>= 0.2' if avg_r >= 0.2 else '< 0.2'}")
print(f"   Status: {'PASS - Model is learning!' if avg_r > 0.2 else 'FAIL - Model not learning'}")

print(f"\n2. Architecture Validation (Good performance?)")
print(f"   {'‚úì' if avg_r >= 0.35 else '‚úó'} Mean r = {avg_r:.3f} {'>= 0.35' if avg_r >= 0.35 else '< 0.35'}")
print(f"   Status: {'PASS - Architecture validated!' if avg_r >= 0.35 else 'NOT MET'}")

print(f"\n3. MVP Target (Excellent performance?)")
print(f"   {'‚úì' if avg_r >= 0.48 else '‚úó'} Mean r = {avg_r:.3f} {'>= 0.48' if avg_r >= 0.48 else '< 0.48'}")
print(f"   Status: {'PASS - MVP TARGET HIT!' if avg_r >= 0.48 else 'NOT MET'}")

print(f"\n" + "="*80)

# Decision
if avg_r >= 0.48:
    print("\nüéâ EXCELLENT! Proceed with real label collection.")
    print("   Audio-only performance already at MVP level.")
    print("   Consider adding MIDI for 15-20% boost.")
elif avg_r >= 0.35:
    print("\n‚úì GOOD! Architecture is validated.")
    print("  Synthetic labels work, but noisy.")
    print("  Proceed with real label collection.")
    print("  Consider training audio+MIDI model to compare.")
elif avg_r >= 0.2:
    print("\n‚ö†Ô∏è MARGINAL. Model learning but weak.")
    print("  Check synthetic label quality.")
    print("  Consider collecting small batch (~50) of real labels first.")
else:
    print("\n‚úó FAILURE. Model not learning.")
    print("  DO NOT collect real labels yet.")
    print("  Debug: Check training logs, label distributions, model outputs.")

In [None]:
# Save results
results_path = f'{checkpoint_dir}/audioonly_results.csv'
results_df.to_csv(results_path, index=False)

print(f"\n‚úì Results saved to: {results_path}")

## 7. Evaluation - Compare Audio+MIDI vs Audio-Only (Optional)

**Only run this section if:**
1. You trained the audio+MIDI model above
2. Audio-only results were promising (r >= 0.35)

This will quantify the multi-modal advantage.

In [None]:
# Load audio+MIDI model
audioMIDI_ckpt = get_best_checkpoint(f'{checkpoint_dir}/baseline_synthetic')

print(f"Loading audio+MIDI model...")
print(f"  Checkpoint: {os.path.basename(audioMIDI_ckpt)}")

model_audioMIDI = PerformanceEvaluationModel.load_from_checkpoint(audioMIDI_ckpt)
model_audioMIDI.eval().cuda()

print(f"\n‚úì Model loaded")

In [None]:
# Create test dataloader with MIDI
test_dataset_midi = PerformanceDataset(
    annotation_path=f'{annotations_dir}/synthetic_test.jsonl',
    dimension_names=config['data']['dimensions'],
    audio_sample_rate=24000,
    max_audio_length=240000,
    max_midi_events=512,
    augmentation_config=None,
    apply_augmentation=False,
)

test_loader_midi = DataLoader(
    test_dataset_midi,
    batch_size=8,
    shuffle=False,
    num_workers=2,
    collate_fn=collate_fn,
    pin_memory=True,
)

print(f"‚úì Test dataloader with MIDI created")

In [None]:
# Evaluate audio+MIDI model
print("Evaluating audio+MIDI model on test set...")
preds_audioMIDI, targets_midi = evaluate_model(model_audioMIDI, test_loader_midi)

print(f"\n‚úì Evaluation complete")

In [None]:
# Compare audio+MIDI vs audio-only
comparison_results = []

for i, dim_name in enumerate(dimension_names):
    # Audio+MIDI correlations
    r_audioMIDI, _ = pearsonr(targets_midi[:, i], preds_audioMIDI[:, i])
    mae_audioMIDI = np.abs(targets_midi[:, i] - preds_audioMIDI[:, i]).mean()

    # Audio-only correlations (recompute on same samples)
    r_audioonly = results_df.loc[results_df['dimension'] == dim_name, 'audioonly_r'].values[0]
    mae_audioonly = results_df.loc[results_df['dimension'] == dim_name, 'audioonly_mae'].values[0]

    # Multi-modal advantage
    advantage = ((r_audioMIDI - r_audioonly) / r_audioonly * 100) if r_audioonly != 0 else 0

    comparison_results.append({
        'dimension': dim_name,
        'audioMIDI_r': r_audioMIDI,
        'audioMIDI_mae': mae_audioMIDI,
        'audioonly_r': r_audioonly,
        'audioonly_mae': mae_audioonly,
        'multimodal_advantage': advantage,
    })

comparison_df = pd.DataFrame(comparison_results)

print("\n" + "="*80)
print("COMPARISON: Audio+MIDI vs Audio-Only")
print("="*80)
print(comparison_df.to_string(index=False))
print("="*80)

# Summary
print(f"\nSummary:")
print(f"  Audio+MIDI mean r: {comparison_df['audioMIDI_r'].mean():.3f}")
print(f"  Audio-only mean r: {comparison_df['audioonly_r'].mean():.3f}")
print(f"  Multi-modal advantage: {comparison_df['multimodal_advantage'].mean():.1f}%")

multimodal_gain = comparison_df['multimodal_advantage'].mean()

print(f"\n4. Multi-modal Advantage")
print(f"   {'‚úì' if multimodal_gain >= 10 else '‚úó'} Advantage = {multimodal_gain:.1f}% {'>= 10%' if multimodal_gain >= 10 else '< 10%'}")
print(f"   Status: {'PASS - MIDI helps!' if multimodal_gain >= 10 else 'NOT MET - MIDI not helping much'}")

print(f"\n" + "="*80)

In [None]:
# Save comparison results
comparison_path = f'{checkpoint_dir}/comparison_results.csv'
comparison_df.to_csv(comparison_path, index=False)

print(f"\n‚úì Comparison results saved to: {comparison_path}")

## Summary

Check the results above to determine next steps:

**If audio-only r >= 0.48:**
- ‚úì MVP target hit with audio alone!
- ‚Üí Proceed with real label collection
- ‚Üí MIDI will provide extra 15-20% boost

**If audio-only r >= 0.35:**
- ‚úì Architecture validated
- ‚Üí Collect real labels (200-300 segments)
- ‚Üí Expect significant improvement

**If audio-only r >= 0.20:**
- ‚ö†Ô∏è Marginal learning
- ‚Üí Collect small batch (50) of real labels first
- ‚Üí Test if real labels improve performance

**If audio-only r < 0.20:**
- ‚úó Model not learning
- ‚Üí Debug before collecting real labels
- ‚Üí Check training logs and label quality

All results are saved to Google Drive for future reference.