# Multi-Model SPIDEr Score Computation (Google Colab)

This notebook computes SPIDEr scores (CIDEr + SPICE) for bioacoustic captioning models.

## Supported Models
- **Qwen2-Audio** - Alibaba's audio-language model
- **NatureLM** - Earth Species Project model (auto-removes timestamps)
- **Gemini Flash/Pro** - Google's multimodal models
- **Any future model** conforming to the output format

## Instructions
1. Upload this notebook to Google Colab
2. Upload your `*_results.json` files when prompted
3. Run all cells
4. Download the generated SPIDEr score files

## Expected File Format
Files should be named: `{model}_{prompt}_{shots}shot_results.json`

Example: `qwen_baseline_3shot_results.json`, `naturelm_ornithologist_5shot_results.json`

In [None]:
# Step 1: Mount Google Drive (optional - for saving results)
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Step 2: Install Required Packages
!pip install pycocoevalcap typing-extensions -q
print("\n" + "="*60)
print("PACKAGES INSTALLED SUCCESSFULLY")
print("="*60)

In [None]:
# Step 3: Configure Java Environment for SPICE
import os
import subprocess

print("Configuring Java environment for SPICE...")

# Update package list and install Java 11
!apt-get update -qq
!apt-get install -y -qq openjdk-11-jdk-headless > /dev/null

# Force system to use Java 11
!update-alternatives --set java /usr/lib/jvm/java-11-openjdk-amd64/bin/java 2>/dev/null || true

# Configure Java Environment Variables for SPICE
os.environ['_JAVA_OPTIONS'] = (
    '-Xmx8g '
    '--add-opens=java.base/java.lang=ALL-UNNAMED '
    '--add-opens=java.base/java.math=ALL-UNNAMED '
    '--add-opens=java.base/java.util=ALL-UNNAMED '
    '--add-opens=java.base/java.util.concurrent=ALL-UNNAMED '
    '--add-opens=java.base/java.net=ALL-UNNAMED '
    '--add-opens=java.base/java.text=ALL-UNNAMED '
    '--add-opens=java.base/java.lang.reflect=ALL-UNNAMED '
    '--add-opens=java.base/java.io=ALL-UNNAMED'
)

# Verify Installation
result = subprocess.run(['java', '-version'], capture_output=True, text=True)
if 'openjdk version "11' in result.stderr:
    print("\n" + "="*60)
    print("JAVA 11 ACTIVATED SUCCESSFULLY")
    print("="*60)
else:
    print("WARNING: Java 11 activation might have failed")
    print(result.stderr.splitlines()[0] if result.stderr else "No version info")

# Ensure SPICE cache directory exists
import sys
python_version = f"{sys.version_info.major}.{sys.version_info.minor}"
spice_cache = f'/usr/local/lib/python{python_version}/dist-packages/pycocoevalcap/spice/cache'
os.makedirs(spice_cache, exist_ok=True)
print(f"SPICE cache ready: {spice_cache}")

In [None]:
# Step 4: Upload Evaluation Result Files
from google.colab import files
import os
import json
import re

# Create results directory
os.makedirs('results', exist_ok=True)

print("="*60)
print("UPLOAD EVALUATION RESULT FILES")
print("="*60)
print("\nUpload all *_results.json files from your evaluation runs.")
print("Supported models: qwen, naturelm, flash, pro, salmonn, etc.")
print("\nExpected naming: {model}_{prompt}_{shots}shot_results.json")
print("Example: qwen_baseline_3shot_results.json\n")

uploaded = files.upload()

for filename in uploaded.keys():
    dest_path = f'results/{filename}'
    if os.path.exists(filename):
        os.rename(filename, dest_path)
    print(f"  Saved: {filename}")

print(f"\n" + "="*60)
print(f"UPLOADED {len(uploaded)} FILES")
print("="*60)

In [None]:
# Step 5: Define Core Functions
from pathlib import Path
from datetime import datetime
from pycocoevalcap.cider.cider import Cider
from pycocoevalcap.spice.spice import Spice
from pycocoevalcap.bleu.bleu import Bleu
from pycocoevalcap.rouge.rouge import Rouge
import re

# Known models that need preprocessing
MODELS_WITH_TIMESTAMPS = ['naturelm']

def parse_filename(filename: str) -> dict:
    """
    Parse model, prompt, and shots from filename.
    Handles: model_prompt_Nshot_results.json
    """
    # Remove _results.json suffix
    base = filename.replace('_results.json', '')
    
    # Try to extract shots (e.g., 0shot, 3shot, 5shot)
    shot_match = re.search(r'(\d+)shot', base)
    if shot_match:
        shots = int(shot_match.group(1))
        base = base.replace(f'{shots}shot', '').rstrip('_')
    else:
        shots = 0
    
    # Split remaining into model and prompt
    parts = base.split('_')
    if len(parts) >= 2:
        model = parts[0]
        prompt = '_'.join(parts[1:])
    else:
        model = parts[0] if parts else 'unknown'
        prompt = 'baseline'
    
    return {'model': model, 'prompt': prompt, 'shots': shots}


def preprocess_naturelm_text(text: str) -> str:
    """
    Remove timestamp annotations from NatureLM output.
    
    NatureLM outputs like:
    "American Woodcock calling...\n#10.00s - 20.00s#: American Woodcock\n#20.00s - 30.00s#: ..."
    
    This function extracts only the first caption before timestamps.
    """
    if not text or not isinstance(text, str):
        return text
    
    # Split on newline and take first non-empty line
    lines = text.strip().split('\n')
    
    # Get first line (main caption)
    first_line = lines[0].strip()
    
    # Remove any leading timestamp like "#0.00s - 10.00s#: "
    timestamp_pattern = r'^#[\d.]+s\s*-\s*[\d.]+s#:\s*'
    first_line = re.sub(timestamp_pattern, '', first_line)
    
    return first_line.strip()


def verify_preprocessing(results_dir: Path):
    """
    Find a NatureLM sample with timestamps and show before/after preprocessing.
    This helps verify the preprocessing is working correctly.
    """
    print("\n" + "="*60)
    print("PREPROCESSING VERIFICATION")
    print("="*60)

    # Look for NatureLM files
    naturelm_files = list(results_dir.glob('naturelm_*_results.json'))

    if not naturelm_files:
        print("\nNo NatureLM files found - preprocessing verification skipped.")
        print("(This is fine if you're only evaluating Qwen/Gemini models)")
        return

    # Load first NatureLM file
    with open(naturelm_files[0], 'r') as f:
        data = json.load(f)

    results = data.get('results', [])

    # Find a sample with timestamps (contains newlines and #)
    sample_with_timestamps = None
    for r in results:
        prediction = r.get('prediction', r.get('response', ''))
        if prediction and '\n' in prediction and '#' in prediction:
            sample_with_timestamps = r
            break

    if not sample_with_timestamps:
        # Fall back to any sample
        sample_with_timestamps = results[0] if results else None

    if sample_with_timestamps:
        original = sample_with_timestamps.get('prediction', sample_with_timestamps.get('response', ''))
        processed = preprocess_naturelm_text(original)

        print(f"\nFile: {naturelm_files[0].name}")
        print(f"Species: {sample_with_timestamps.get('species', 'Unknown')}")

        print("\n" + "-"*60)
        print("BEFORE preprocessing:")
        print("-"*60)
        # Show first 500 chars to avoid overwhelming output
        display_text = original[:500] + "..." if len(original) > 500 else original
        print(repr(display_text))

        print("\n" + "-"*60)
        print("AFTER preprocessing:")
        print("-"*60)
        print(repr(processed))

        # Show what was removed
        if original != processed:
            print("\n" + "-"*60)
            print("VERIFICATION: Timestamps successfully removed")
            print(f"  Original length: {len(original)} chars")
            print(f"  Processed length: {len(processed)} chars")
            print(f"  Removed: {len(original) - len(processed)} chars")
            print("-"*60)
        else:
            print("\n(No timestamps found in this sample - text unchanged)")
    else:
        print("\nNo samples found in NatureLM file for verification.")


def compute_spider_scores(result_data: dict, model_name: str = None):
    """
    Compute CIDEr, SPICE, SPIDEr, BLEU-4, and ROUGE-L scores.
    
    Args:
        result_data: JSON data from evaluation result file
        model_name: Model name for preprocessing decisions
    
    Returns:
        Dictionary with all computed metrics
    """
    if not result_data:
        return None

    results = result_data.get('results', [])
    successful_results = [r for r in results if r.get('success', False)]

    if not successful_results:
        return {
            'total': len(results),
            'successful': 0,
            'cider': 0.0,
            'spice': 0.0,
            'spider': 0.0,
            'bleu_4': 0.0,
            'rouge_l': 0.0
        }

    # Determine if preprocessing is needed
    needs_preprocessing = model_name and model_name.lower() in MODELS_WITH_TIMESTAMPS

    # Prepare data in pycocoevalcap format
    gts = {}  # ground truths (references)
    res = {}  # results (predictions)

    for idx, r in enumerate(successful_results):
        sample_id = str(idx)
        
        # Get reference (ground truth)
        reference = r.get('reference', r.get('ground_truth', ''))
        gts[sample_id] = [reference]
        
        # Get prediction
        prediction = r.get('prediction', r.get('response', ''))
        
        # Preprocess if needed (e.g., NatureLM timestamps)
        if needs_preprocessing:
            prediction = preprocess_naturelm_text(prediction)
        
        res[sample_id] = [prediction]

    # Compute metrics
    try:
        # CIDEr
        cider_scorer = Cider()
        cider_score, _ = cider_scorer.compute_score(gts, res)

        # SPICE
        try:
            spice_scorer = Spice()
            spice_score, _ = spice_scorer.compute_score(gts, res)
            print(f"    SPICE: {spice_score:.4f}")
        except Exception as e:
            print(f"    SPICE failed: {str(e)[:50]}")
            spice_score = 0.0

        # SPIDEr = (CIDEr + SPICE) / 2
        spider_score = (cider_score + spice_score) / 2

        # BLEU-4
        bleu_scorer = Bleu(4)
        bleu_scores, _ = bleu_scorer.compute_score(gts, res)
        bleu_4 = bleu_scores[3]

        # ROUGE-L
        rouge_scorer = Rouge()
        rouge_score, _ = rouge_scorer.compute_score(gts, res)

        return {
            'total': len(results),
            'successful': len(successful_results),
            'cider': float(cider_score),
            'spice': float(spice_score),
            'spider': float(spider_score),
            'bleu_4': float(bleu_4),
            'rouge_l': float(rouge_score)
        }
    except Exception as e:
        print(f"    Error: {e}")
        return None


print("="*60)
print("FUNCTIONS DEFINED SUCCESSFULLY")
print("="*60)
print("\nPreprocessing enabled for: NatureLM (removes timestamps)")

# Run preprocessing verification
verify_preprocessing(Path('results'))

In [None]:
# Step 6: Compute SPIDEr Scores for All Uploaded Files
from pathlib import Path

results_dir = Path('results')
all_results = []
models_found = set()

# Get all JSON files
json_files = sorted(results_dir.glob('*_results.json'))

print("="*80)
print("COMPUTING SPIDEr SCORES")
print("="*80)
print(f"\nFound {len(json_files)} result files\n")

for idx, filepath in enumerate(json_files):
    filename = filepath.name
    parsed = parse_filename(filename)
    model = parsed['model']
    prompt = parsed['prompt']
    shots = parsed['shots']
    
    config_name = f"{model}_{prompt}_{shots}shot"
    models_found.add(model)
    
    print(f"[{idx+1}/{len(json_files)}] {config_name}")
    
    # Load result data
    try:
        with open(filepath, 'r') as f:
            result_data = json.load(f)
    except Exception as e:
        print(f"    Failed to load: {e}")
        all_results.append({
            'config': config_name,
            'model': model,
            'prompt': prompt,
            'shots': shots,
            'status': 'load_error',
            'metrics': None
        })
        continue
    
    # Compute metrics
    metrics = compute_spider_scores(result_data, model_name=model)
    
    if metrics:
        print(f"    SPIDEr: {metrics['spider']:.4f} (CIDEr: {metrics['cider']:.4f})")
        all_results.append({
            'config': config_name,
            'model': model,
            'prompt': prompt,
            'shots': shots,
            'status': 'success',
            'metrics': metrics
        })
    else:
        print(f"    Failed to compute metrics")
        all_results.append({
            'config': config_name,
            'model': model,
            'prompt': prompt,
            'shots': shots,
            'status': 'error',
            'metrics': None
        })

successful = sum(1 for r in all_results if r['status'] == 'success')
print(f"\n" + "="*80)
print(f"COMPUTATION COMPLETE: {successful}/{len(json_files)} successful")
print(f"Models found: {', '.join(sorted(models_found))}")
print("="*80)

In [None]:
# Step 7: Generate Output Files
from datetime import datetime

timestamp = datetime.now().isoformat()

# Create output for each model
output_files = []

for model in sorted(models_found):
    model_results = [r for r in all_results if r['model'] == model]
    successful_count = sum(1 for r in model_results if r['status'] == 'success')
    
    output_data = {
        'timestamp': timestamp,
        'model': model,
        'total_configs': len(model_results),
        'successful_configs': successful_count,
        'spice_enabled': True,
        'preprocessing_applied': model.lower() in MODELS_WITH_TIMESTAMPS,
        'results': model_results
    }
    
    output_filename = f'spider_scores_{model}.json'
    with open(output_filename, 'w') as f:
        json.dump(output_data, f, indent=2)
    
    output_files.append(output_filename)
    print(f"Saved: {output_filename} ({successful_count} configs)")

# Create combined output
combined_output = {
    'timestamp': timestamp,
    'models': list(sorted(models_found)),
    'total_configs': len(all_results),
    'successful_configs': sum(1 for r in all_results if r['status'] == 'success'),
    'spice_enabled': True,
    'results': all_results
}

combined_filename = 'spider_scores_all_models.json'
with open(combined_filename, 'w') as f:
    json.dump(combined_output, f, indent=2)
output_files.append(combined_filename)

print(f"\nSaved combined file: {combined_filename}")
print(f"\n" + "="*60)
print(f"GENERATED {len(output_files)} OUTPUT FILES")
print("="*60)

In [None]:
# Step 8: Display Descriptive Statistics
import pandas as pd

print("="*80)
print("DESCRIPTIVE STATISTICS: SPIDEr SCORES BY MODEL")
print("="*80)

# Create DataFrame for easier analysis
successful_results = [r for r in all_results if r['status'] == 'success']

if not successful_results:
    print("\nNo successful results to analyze.")
else:
    # Build data for DataFrame
    data = []
    for r in successful_results:
        data.append({
            'Model': r['model'],
            'Prompt': r['prompt'],
            'Shots': r['shots'],
            'Config': r['config'],
            'SPIDEr': r['metrics']['spider'],
            'CIDEr': r['metrics']['cider'],
            'SPICE': r['metrics']['spice'],
            'BLEU-4': r['metrics']['bleu_4'],
            'ROUGE-L': r['metrics']['rouge_l'],
            'Samples': r['metrics']['successful']
        })
    
    df = pd.DataFrame(data)
    
    # ===== PER-MODEL SUMMARY =====
    print("\n" + "-"*80)
    print("OVERALL PERFORMANCE BY MODEL")
    print("-"*80)
    
    model_summary = df.groupby('Model').agg({
        'SPIDEr': ['mean', 'std', 'min', 'max'],
        'CIDEr': 'mean',
        'SPICE': 'mean',
        'Config': 'count'
    }).round(4)
    model_summary.columns = ['SPIDEr Mean', 'SPIDEr Std', 'SPIDEr Min', 'SPIDEr Max', 'CIDEr Mean', 'SPICE Mean', 'Configs']
    print(model_summary.to_string())
    
    # ===== BEST CONFIG PER MODEL =====
    print("\n" + "-"*80)
    print("BEST CONFIGURATION PER MODEL")
    print("-"*80)
    
    for model in df['Model'].unique():
        model_df = df[df['Model'] == model]
        best_idx = model_df['SPIDEr'].idxmax()
        best = model_df.loc[best_idx]
        print(f"\n{model.upper()}:")
        print(f"  Best: {best['Prompt']} {best['Shots']}-shot")
        print(f"  SPIDEr: {best['SPIDEr']:.4f} (CIDEr: {best['CIDEr']:.4f}, SPICE: {best['SPICE']:.4f})")
    
    # ===== PERFORMANCE BY PROMPT TYPE =====
    print("\n" + "-"*80)
    print("PERFORMANCE BY PROMPT TYPE (averaged across shots)")
    print("-"*80)
    
    prompt_summary = df.groupby(['Model', 'Prompt'])['SPIDEr'].mean().unstack(fill_value=0).round(4)
    print(prompt_summary.to_string())
    
    # ===== PERFORMANCE BY SHOT COUNT =====
    print("\n" + "-"*80)
    print("PERFORMANCE BY SHOT COUNT (averaged across prompts)")
    print("-"*80)
    
    shot_summary = df.groupby(['Model', 'Shots'])['SPIDEr'].mean().unstack(fill_value=0).round(4)
    print(shot_summary.to_string())
    
    # ===== FULL RESULTS TABLE =====
    print("\n" + "-"*80)
    print("FULL RESULTS TABLE")
    print("-"*80)
    
    display_df = df[['Model', 'Prompt', 'Shots', 'SPIDEr', 'CIDEr', 'SPICE', 'BLEU-4', 'ROUGE-L']].copy()
    display_df = display_df.sort_values(['Model', 'Prompt', 'Shots'])
    print(display_df.to_string(index=False))

print("\n" + "="*80)

In [None]:
# Step 9: Visualize Results (Optional)
import matplotlib.pyplot as plt
import numpy as np

if successful_results:
    # Create visualization
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Plot 1: SPIDEr by Model and Prompt
    ax1 = axes[0]
    models = df['Model'].unique()
    prompts = df['Prompt'].unique()
    
    x = np.arange(len(prompts))
    width = 0.8 / len(models)
    
    for i, model in enumerate(models):
        model_data = df[df['Model'] == model]
        means = [model_data[model_data['Prompt'] == p]['SPIDEr'].mean() for p in prompts]
        ax1.bar(x + i*width - 0.4 + width/2, means, width, label=model)
    
    ax1.set_xlabel('Prompt Type')
    ax1.set_ylabel('SPIDEr Score')
    ax1.set_title('SPIDEr Score by Model and Prompt')
    ax1.set_xticks(x)
    ax1.set_xticklabels(prompts, rotation=45, ha='right')
    ax1.legend()
    ax1.grid(axis='y', alpha=0.3)
    
    # Plot 2: SPIDEr by Model and Shot Count
    ax2 = axes[1]
    shots = sorted(df['Shots'].unique())
    
    for model in models:
        model_data = df[df['Model'] == model]
        means = [model_data[model_data['Shots'] == s]['SPIDEr'].mean() for s in shots]
        ax2.plot(shots, means, marker='o', label=model, linewidth=2, markersize=8)
    
    ax2.set_xlabel('Shot Count')
    ax2.set_ylabel('SPIDEr Score')
    ax2.set_title('SPIDEr Score by Model and Shot Count')
    ax2.set_xticks(shots)
    ax2.legend()
    ax2.grid(alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('spider_scores_visualization.png', dpi=150, bbox_inches='tight')
    plt.show()
    
    print("\nVisualization saved: spider_scores_visualization.png")

In [None]:
# Step 10: Download All Result Files
from google.colab import files

print("="*60)
print("DOWNLOADING RESULT FILES")
print("="*60)

# Download all output files
for filename in output_files:
    if os.path.exists(filename):
        print(f"Downloading: {filename}")
        files.download(filename)

# Download visualization if it exists
if os.path.exists('spider_scores_visualization.png'):
    print("Downloading: spider_scores_visualization.png")
    files.download('spider_scores_visualization.png')

print("\n" + "="*60)
print("ALL FILES DOWNLOADED SUCCESSFULLY")
print("="*60)
print("\nFiles contain TRUE SPIDEr scores with:")
print("  - CIDEr (consensus-based image description evaluation)")
print("  - SPICE (semantic propositional image caption evaluation)")
print("  - SPIDEr = (CIDEr + SPICE) / 2")

---

## Notes on Model-Specific Preprocessing

### NatureLM
NatureLM outputs include timestamp annotations for longer audio files:
```
American Woodcock calling with American Robin in background.
#10.00s - 20.00s#: American Woodcock
#20.00s - 30.00s#: American Woodcock calling...
```

This notebook **automatically extracts only the first caption** before computing SPIDEr scores, ensuring fair comparison with other models.

### Adding New Models
To add preprocessing for a new model:
1. Add the model name (lowercase) to `MODELS_WITH_TIMESTAMPS`
2. Create a preprocessing function similar to `preprocess_naturelm_text()`
3. Update `compute_spider_scores()` to call the new preprocessor

### Output File Format
Each model gets its own JSON file:
```json
{
  "model": "qwen",
  "total_configs": 12,
  "results": [
    {
      "config": "qwen_baseline_3shot",
      "metrics": {
        "spider": 0.0723,
        "cider": 0.0456,
        "spice": 0.0990
      }
    }
  ]
}
```