# Generalized Bioacoustic LLM Evaluation

**Evaluation Framework for Open-Weights Audio LLMs on Bioacoustic Captioning**

---

## Supported Models
| Model | VRAM | Status |
|-------|------|--------|
| **Qwen2-Audio-7B** | ~14GB | ✅ Working |
| **NatureLM-audio** | ~10GB | ✅ Working |
| **SALMONN** | ~29GB | ⏳ Future Work |

## Evaluation Configurations
- **Prompt Roles**: baseline, ornithologist, skeptical, multi-taxa
- **Shot Configs**: 0-shot, 3-shot, 5-shot
- **Total**: 2 models × 4 prompts × 3 shots = **24 configurations**

## Dataset
- **AnimalSpeak SPIDEr Benchmark**: 500 samples with human-written captions

---

## Quick Start
1. Set runtime to **A100 GPU** (Runtime → Change runtime type)
2. Run cells in order
3. Results saved to Google Drive

---

## Step 1: Check GPU & Mount Drive

In [None]:
import torch
import os

# Check GPU
print("=" * 60)
print("GPU CHECK")
print("=" * 60)

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    total_vram = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"GPU: {gpu_name}")
    print(f"VRAM: {total_vram:.1f} GB")
    
    if total_vram >= 40:
        print("✓ A100 detected - All models supported")
    elif total_vram >= 16:
        print("⚠ T4/L4 detected - Qwen and NatureLM only")
    else:
        print("❌ Insufficient VRAM - Upgrade to A100")
else:
    print("❌ No GPU detected!")
    print("Go to: Runtime → Change runtime type → A100 GPU")

# Mount Google Drive
print("\n" + "=" * 60)
print("MOUNTING GOOGLE DRIVE")
print("=" * 60)

from google.colab import drive
drive.mount('/content/drive')

# Create output directory
OUTPUT_DIR = '/content/drive/MyDrive/AcousticLLMeval_Results'
os.makedirs(OUTPUT_DIR, exist_ok=True)
print(f"\n✓ Output directory: {OUTPUT_DIR}")

## Step 2: Install Dependencies

This installs all required packages. Takes ~3-5 minutes.

In [None]:
%%bash
echo "Installing dependencies..."

# Upgrade pip
pip install --upgrade pip -q

# Core ML packages
pip install torch torchvision torchaudio -q
pip install transformers accelerate -q

# Audio processing
pip install librosa soundfile requests -q

# CRITICAL: Upgrade Pillow to fix '_Ink' import error
pip install --upgrade 'Pillow>=10.0.0' -q

# NatureLM dependencies
pip install peft einops omegaconf cloudpathlib -q
pip install google-cloud-storage tensorboardx wandb timm -q
pip install pydantic-settings pydub resampy -q
pip install pandas mir-eval levenshtein memoization plumbum tensorboard -q

echo ""
echo "============================================================"
echo "✓ All dependencies installed!"
echo ""
echo "⚠️  IMPORTANT: You MUST restart the runtime now!"
echo "   Go to: Runtime → Restart runtime"
echo "   Then continue from Step 3 (skip Steps 1-2)"
echo "============================================================"

## Step 3: Clone Repository & Setup NatureLM

In [None]:
import os
import sys

# Clone evaluation repository
if not os.path.exists('/content/AcousticLLMevalGeneralized'):
    print("Cloning AcousticLLMevalGeneralized...")
    !git clone https://github.com/Ray149s/AcousticLLMevalGeneralized.git /content/AcousticLLMevalGeneralized
else:
    print("✓ Repository already exists")
    !cd /content/AcousticLLMevalGeneralized && git pull

# Clone NatureLM
if not os.path.exists('/content/NatureLM-audio'):
    print("\nCloning NatureLM-audio...")
    !git clone https://github.com/earthspecies/NatureLM-audio.git /content/NatureLM-audio
else:
    print("✓ NatureLM-audio already exists")

# CRITICAL FIX: Patch NatureLM for newer transformers (>=4.40)
# Multiple functions moved from modeling_utils to pytorch_utils
print("\nPatching NatureLM for transformers compatibility...")
file_path = '/content/NatureLM-audio/NatureLM/models/Qformer.py'

with open(file_path, 'r') as f:
    content = f.read()

# Check if already patched
if 'from transformers.pytorch_utils import' not in content:
    old_import = "from transformers.modeling_utils import ("
    new_import = """from transformers.pytorch_utils import apply_chunking_to_forward, find_pruneable_heads_and_indices, prune_linear_layer
from transformers.modeling_utils import ("""
    
    content = content.replace(old_import, new_import)
    content = content.replace("    apply_chunking_to_forward,\n", "")
    content = content.replace("    find_pruneable_heads_and_indices,\n", "")
    content = content.replace("    prune_linear_layer,\n", "")
    
    with open(file_path, 'w') as f:
        f.write(content)
    print("✓ NatureLM patched")
else:
    print("✓ NatureLM already patched")

# Install NatureLM (no-deps to avoid conflicts)
print("\nInstalling NatureLM...")
!pip install --no-deps -e /content/NatureLM-audio -q

# Add to Python path
sys.path.insert(0, '/content/AcousticLLMevalGeneralized')
sys.path.insert(0, '/content/NatureLM-audio')

print("\n✓ Setup complete!")

## Step 4: HuggingFace Authentication

Required for NatureLM (needs Llama-3.1 access).

**Get your token:** https://huggingface.co/settings/tokens

**Request access to:** https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct

In [None]:
from huggingface_hub import login
import os

# Try Colab secrets first
try:
    from google.colab import userdata
    HF_TOKEN = userdata.get('HF_TOKEN')
    print("✓ Using HF_TOKEN from Colab secrets")
except:
    # Manual entry
    from getpass import getpass
    print("Enter your HuggingFace token:")
    HF_TOKEN = getpass()

# Login
login(token=HF_TOKEN, add_to_git_credential=False)
os.environ['HF_TOKEN'] = HF_TOKEN

print("✓ Authenticated with HuggingFace")

## Step 5: Verify Model Imports

Test that both models can be imported before running evaluation.

In [None]:
print("Testing model imports...\n")

# Test Qwen2Audio
try:
    from transformers import Qwen2AudioForConditionalGeneration, Qwen2AudioProcessor
    print("✓ Qwen2Audio imports OK")
except ImportError as e:
    print(f"❌ Qwen2Audio import failed: {e}")

# Test NatureLM
try:
    from NatureLM.models import NatureLM
    from NatureLM.infer import Pipeline
    print("✓ NatureLM imports OK")
except ImportError as e:
    print(f"❌ NatureLM import failed: {e}")

# Test prompt config
try:
    from prompt_config import build_prompt, PROMPT_VERSIONS, SHOT_CONFIGS
    print("✓ Prompt config imports OK")
    print(f"  Prompt versions: {PROMPT_VERSIONS}")
    print(f"  Shot configs: {SHOT_CONFIGS}")
except ImportError as e:
    print(f"❌ Prompt config import failed: {e}")

print("\n✓ All imports successful!")

## Step 6: Configuration

Configure which models, prompts, and shots to evaluate.

In [None]:
# ============================================================
# EVALUATION CONFIGURATION
# ============================================================

# Models to evaluate
MODELS = ['qwen', 'naturelm']  # Options: 'qwen', 'naturelm'

# Prompt versions (from Gemini evaluation)
PROMPT_VERSIONS = ['baseline', 'ornithologist', 'skeptical', 'multi-taxa']

# Shot configurations
SHOT_CONFIGS = [0, 3, 5]

# Dataset
BENCHMARK_PATH = '/content/AcousticLLMevalGeneralized/animalspeak_spider_benchmark.jsonl'

# Samples (None = all 500, or set integer for testing)
MAX_SAMPLES = None  # Set to 10 for quick test

# ============================================================

# Calculate total configurations
total_configs = len(MODELS) * len(PROMPT_VERSIONS) * len(SHOT_CONFIGS)
samples = MAX_SAMPLES if MAX_SAMPLES else 500

print("=" * 60)
print("EVALUATION CONFIGURATION")
print("=" * 60)
print(f"Models: {MODELS}")
print(f"Prompts: {PROMPT_VERSIONS}")
print(f"Shots: {SHOT_CONFIGS}")
print(f"Samples: {samples}")
print(f"Total configs: {total_configs}")
print(f"Total evaluations: {total_configs * samples:,}")
print("=" * 60)

## Step 7: Run Evaluation

This runs the full evaluation across all configurations.

In [None]:
import os
os.chdir('/content/AcousticLLMevalGeneralized')

from run_full_evaluation import run_full_evaluation

# Run evaluation
results = run_full_evaluation(
    models=MODELS,
    prompt_versions=PROMPT_VERSIONS,
    shot_configs=SHOT_CONFIGS,
    jsonl_path=BENCHMARK_PATH,
    max_samples=MAX_SAMPLES,
    output_dir=OUTPUT_DIR,
)

print("\n" + "=" * 60)
print("EVALUATION COMPLETE")
print("=" * 60)
print(f"Results saved to: {OUTPUT_DIR}")

## Step 8: View Sample Results

Preview some example predictions from both models.

In [None]:
import json
from pathlib import Path

# Find result files
result_files = list(Path(OUTPUT_DIR).glob('*_results.json'))

print(f"Found {len(result_files)} result files\n")

# Show sample from each model
for model in MODELS:
    model_files = [f for f in result_files if model in f.name]
    if model_files:
        # Load first config (baseline 0-shot)
        with open(model_files[0]) as f:
            data = json.load(f)
        
        print("=" * 60)
        print(f"MODEL: {model.upper()}")
        print(f"Config: {data.get('config_name', 'unknown')}")
        print("=" * 60)
        
        # Show first 3 samples
        for i, result in enumerate(data.get('results', [])[:3]):
            print(f"\nSample {i+1}: {result.get('species', 'Unknown')}")
            print(f"Reference: {result.get('reference', 'N/A')}")
            print(f"Prediction: {result.get('prediction', 'N/A')[:100]}...")
            print(f"Latency: {result.get('latency', 0):.2f}s")
        
        print()

## Step 9: Generate Summary Table

In [None]:
import json
from pathlib import Path
import pandas as pd

# Collect all results
summary_data = []

for result_file in Path(OUTPUT_DIR).glob('*_results.json'):
    with open(result_file) as f:
        data = json.load(f)
    
    model = data.get('model', result_file.stem.split('_')[0])
    prompt = data.get('prompt_version', 'unknown')
    shots = data.get('n_shots', 0)
    
    summary_data.append({
        'Model': model,
        'Prompt': prompt,
        'Shots': shots,
        'Samples': data.get('samples_tested', 0),
        'Success': data.get('successful', 0),
        'Avg Latency': f"{data.get('avg_latency', 0):.2f}s",
    })

# Create DataFrame
df = pd.DataFrame(summary_data)
df = df.sort_values(['Model', 'Prompt', 'Shots'])

print("=" * 80)
print("EVALUATION SUMMARY")
print("=" * 80)
print(df.to_string(index=False))

# Save summary
summary_path = f"{OUTPUT_DIR}/evaluation_summary.csv"
df.to_csv(summary_path, index=False)
print(f"\n✓ Summary saved to: {summary_path}")

## Step 10: Download Results

Download all results as a ZIP file.

In [None]:
import zipfile
from google.colab import files
from datetime import datetime

# Create ZIP
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
zip_path = f'/content/evaluation_results_{timestamp}.zip'

print("Creating ZIP archive...")
with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zipf:
    for result_file in Path(OUTPUT_DIR).glob('*.json'):
        zipf.write(result_file, result_file.name)
    for csv_file in Path(OUTPUT_DIR).glob('*.csv'):
        zipf.write(csv_file, csv_file.name)

print(f"✓ ZIP created: {zip_path}")
print("\nDownloading...")
files.download(zip_path)

print("\n✓ Download complete!")
print(f"Results also saved in Google Drive: {OUTPUT_DIR}")

---

## Appendix: Example Prompt (3-shot baseline)

This shows what prompt is sent to the models:

In [None]:
from prompt_config import build_prompt, FewShotExample

# Example few-shot examples
examples = [
    FewShotExample(
        caption='a chorus of squirrel treefrogs with southern leopard frogs.',
        environment='iNaturalist'
    ),
    FewShotExample(
        caption='faint song of an alder flycatcher.',
        environment='iNaturalist'
    ),
    FewShotExample(
        caption='an american woodcock calls with a series of peents at dusk.',
        environment='iNaturalist'
    ),
]

# Build prompt
prompt = build_prompt(prompt_version='baseline', examples=examples)

print("=" * 60)
print("EXAMPLE PROMPT (baseline, 3-shot)")
print("=" * 60)
print(prompt)
print("=" * 60)

## Appendix: Troubleshooting

### Common Issues

| Issue | Solution |
|-------|----------|
| `cannot import name '_Ink' from 'PIL._typing'` | Run: `pip install --upgrade 'Pillow>=10.0.0'` then **restart runtime** |
| `Qwen2AudioForConditionalGeneration` not found | Same as above - Pillow version issue |
| `No module named 'peft'` | Run: `pip install peft` |
| `No module named 'NatureLM'` | Run: `pip install --no-deps -e /content/NatureLM-audio` |
| `No module named 'run_full_evaluation'` | Run: `!cd /content/AcousticLLMevalGeneralized && git pull` |
| Out of memory | Use `MAX_SAMPLES = 10` for testing |
| HuggingFace auth failed | Check token and Llama-3.1 access |

### After Step 2: MUST Restart Runtime
The Pillow upgrade requires a runtime restart to take effect:
1. Run Step 2 (Install Dependencies)
2. **Runtime → Restart runtime**
3. Continue from Step 3 (skip Steps 1-2)

### Quick Fix for Pillow Error
```python
!pip install --upgrade 'Pillow>=10.0.0'
# Then: Runtime → Restart runtime
```