# YOLO Model Benchmarking Notebook
This notebook demonstrates how to benchmark the YOLO models on new datasets. 

It can be used to test a model on new data, or to output new visualizations of a model on it's validation data.

## Features
- **Flexible data selection**: Choose between validation-only (`val`) or all data (`all`) modes
- **Comprehensive metrics**: Precision, recall, F1-score, and per-class performance
- **Multiple model support**: Test different YOLO models on the same dataset
- **Visualization**: Charts and graphs for easy result interpretation
- **Results export**: Save results to JSON files for further analysis

## Setup and Imports

In [3]:
import sys
import os
from pathlib import Path

# Add the beeyolo directory to the path so we can import our modules
notebook_dir = Path.cwd()
project_root = notebook_dir.parent
beeyolo_path = project_root / "beeyolo"
sys.path.insert(0, str(beeyolo_path))

# Now import our benchmark module
from benchmark_YOLO import YOLOBenchmark

# Standard data science imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)

print(f"Project root: {project_root}")
print(f"BeeYOLO path: {beeyolo_path}")
print(f"Current working directory: {notebook_dir}")

# Check if directories exist
models_dir = project_root / "models"
datasets_dir = project_root / "datasets"

print(f"Models directory exists: {models_dir.exists()}")
print(f"Datasets directory exists: {datasets_dir.exists()}")

Project root: /Users/user/Documents/BEEhaviourLab/BEEhaviourLab-YOLO-training
BeeYOLO path: /Users/user/Documents/BEEhaviourLab/BEEhaviourLab-YOLO-training/beeyolo
Current working directory: /Users/user/Documents/BEEhaviourLab/BEEhaviourLab-YOLO-training/notebooks
Models directory exists: True
Datasets directory exists: True


## Initialize the Benchmark

In [4]:
# Initialize the benchmark with correct paths
print("Initializing YOLO Benchmark...")
benchmark = YOLOBenchmark(
    models_dir=str(project_root / "models"),
    datasets_dir=str(project_root / "datasets")
)
print("✓ Benchmark initialized successfully!")

Initializing YOLO Benchmark...
✓ Benchmark initialized successfully!


## Explore available models and datasets

In [5]:
# Get available models and datasets
models = benchmark.get_available_models()
datasets = benchmark.get_available_datasets()

print("📁 Available Models:")
for i, model in enumerate(models, 1):
    print(f"  {i}. {model}")

print(f"\n�� Available Datasets:")
for i, dataset in enumerate(datasets, 1):
    print(f"  {i}. {dataset}")
    
    # Show dataset info
    try:
        config = benchmark.load_dataset_info(dataset)
        print(f"     Classes: {config['nc']} - {', '.join(config['names'])}")
    except Exception as e:
        print(f"     Error loading config: {e}")

📁 Available Models:
  1. mortality-classYOLO
  2. beeYOLO

�� Available Datasets:
  1. bee-feeder-data
     Classes: 2 - bee, feeder
  2. alive-dead-data
     Classes: 3 - bee_alive, bee_dead, feeder
  3. hoverfly-data
     Classes: 2 - bee, feeder


## Dataset information and statistics

In [6]:
# Analyze dataset statistics
dataset_stats = {}

for dataset_name in datasets:
    try:
        # Get validation data stats
        val_images, val_labels = benchmark.get_dataset_images_and_labels(dataset_name, "val")
        
        # Get all data stats
        all_images, all_labels = benchmark.get_dataset_images_and_labels(dataset_name, "all")
        
        dataset_stats[dataset_name] = {
            'val_images': len(val_images),
            'val_labels': len(val_labels),
            'all_images': len(all_images),
            'all_labels': len(all_labels)
        }
    except Exception as e:
        print(f"Error processing {dataset_name}: {e}")

# Display as a table
if dataset_stats:
    df_stats = pd.DataFrame(dataset_stats).T
    df_stats.columns = ['Val Images', 'Val Labels', 'All Images', 'All Labels']
    display(df_stats)
else:
    print("No dataset statistics available")

Unnamed: 0,Val Images,Val Labels,All Images,All Labels
bee-feeder-data,30,30,150,150
alive-dead-data,50,49,205,204
hoverfly-data,38,38,113,113


## Example 1: Benchmark on Validation Data Only
This demonstrates the standard evaluation mode using only validation data.

In [8]:
# Choose a model and dataset for this example
if models and datasets:
    model_name = models[1]  # First available model
    dataset_name = datasets[2]  # Hoverfly dataset
    
    print(f"Running benchmark on validation data only...")
    print(f"Model: {model_name}")
    print(f"Dataset: {dataset_name}")
    print(f"Mode: val (validation only)")
    
    # Run benchmark
    results_val = benchmark.run_benchmark(
        model_name=model_name,
        dataset_name=dataset_name,
        mode="val",
        conf_threshold=0.5,
        iou_threshold=0.5
    )
    
    print("Benchmark completed!")
    
else:
    print("No models or datasets found!")

Running benchmark on validation data only...
Model: beeYOLO
Dataset: hoverfly-data
Mode: val (validation only)
Evaluating model /Users/user/Documents/BEEhaviourLab/BEEhaviourLab-YOLO-training/models/beeYOLO.pt on dataset hoverfly-data (val mode)
Found 38 images to evaluate


Processing images:   0%|          | 0/38 [00:00<?, ?it/s]

Debug: First image dimensions: 1920x1080
Debug: First GT label format: [0, 0.258854, 0.783333, 0.085417, 0.148148]


Processing images:   8%|▊         | 3/38 [00:01<00:14,  2.34it/s]

Debug: First detection format: [0, np.float32(490.0893), np.float32(847.4977), np.float32(146.88245), np.float32(160.90063), array(    0.88237, dtype=float32)]
Debug: Comparing detection [0, np.float32(490.0893), np.float32(847.4977), np.float32(146.88245), np.float32(160.90063), array(    0.88237, dtype=float32)] with GT [0, 496.99967999999996, 845.9996399999999, 164.00064, 159.99984]


Processing images:  13%|█▎        | 5/38 [00:01<00:08,  4.11it/s]



Processing images:  24%|██▎       | 9/38 [00:01<00:03,  7.72it/s]



Processing images:  34%|███▍      | 13/38 [00:02<00:02, 10.44it/s]



Processing images:  45%|████▍     | 17/38 [00:02<00:01, 12.46it/s]



Processing images:  50%|█████     | 19/38 [00:02<00:01, 13.05it/s]



Processing images:  61%|██████    | 23/38 [00:02<00:01, 13.68it/s]



Processing images:  66%|██████▌   | 25/38 [00:03<00:00, 14.08it/s]



Processing images:  76%|███████▋  | 29/38 [00:03<00:00, 14.44it/s]



Processing images:  82%|████████▏ | 31/38 [00:03<00:00, 14.59it/s]



Processing images:  97%|█████████▋| 37/38 [00:03<00:00, 14.83it/s]



Processing images: 100%|██████████| 38/38 [00:03<00:00,  9.60it/s]

Benchmark completed!





## Display Validation Results

In [9]:
if 'results_val' in locals():
    # Print detailed results
    benchmark.print_results(results_val)
    
    # Create a summary DataFrame for easier analysis
    summary_data = {
        'Metric': ['Precision', 'Recall', 'F1-Score'],
        'Value': [
            results_val['overall_precision'],
            results_val['overall_recall'],
            results_val['overall_f1_score']
        ]
    }
    
    df_summary = pd.DataFrame(summary_data)
    display(df_summary)
else:
    print("No validation results available")


BENCHMARK RESULTS
Model: /Users/user/Documents/BEEhaviourLab/BEEhaviourLab-YOLO-training/models/beeYOLO.pt
Dataset: hoverfly-data
Mode: val
Total Images: 38
Total Ground Truth: 76
Total Detections: 58
Total Correct: 0
Confidence Threshold: 0.5
IoU Threshold: 0.5

Overall Metrics:
  Precision: 0.0000
  Recall: 0.0000
  F1-Score: 0.0000

Per-Class Metrics:
  bee:
    Precision: 0.0000
    Recall: 0.0000
    F1-Score: 0.0000
    TP: 0, FP: 36, FN: 38
  feeder:
    Precision: 0.0000
    Recall: 0.0000
    F1-Score: 0.0000
    TP: 0, FP: 22, FN: 38


Unnamed: 0,Metric,Value
0,Precision,0.0
1,Recall,0.0
2,F1-Score,0.0


## Example 2: Benchmark on All Data (Train + Validation)

This demonstrates using the combined dataset mode, useful when the model wasn't trained on any of the data.

In [7]:
if 'model_name' in locals() and 'dataset_name' in locals():
    print(f"🔍 Running benchmark on all data (train + validation)...")
    print(f"Model: {model_name}")
    print(f"Dataset: {dataset_name}")
    print(f"Mode: all (train + validation)")
    
    # Run benchmark on all data
    results_all = benchmark.run_benchmark(
        model_name=model_name,
        dataset_name=dataset_name,
        mode="all",
        conf_threshold=0.5,
        iou_threshold=0.5
    )
    
    print("Benchmark completed!")
    
else:
    print("Model or dataset not defined")

Model or dataset not defined


## Display results

In [1]:
if 'results_val' in locals():
    # Print detailed results
    benchmark.print_results(results_val)
    
    # Create a summary DataFrame for easier analysis
    summary_data = {
        'Metric': ['Precision', 'Recall', 'F1-Score'],
        'Value': [
            results_val['overall_precision'],
            results_val['overall_recall'],
            results_val['overall_f1_score']
        ]
    }
    
    df_summary = pd.DataFrame(summary_data)
    display(df_summary)
else:
    print("No validation results available")

No validation results available


## Compare Validation vs All Data Results

In [None]:
if 'results_val' in locals() and 'results_all' in locals():
    # Create comparison DataFrame
    comparison_data = {
        'Mode': ['Validation Only', 'All Data'],
        'Images': [results_val['total_images'], results_all['total_images']],
        'Ground Truth': [results_val['total_ground_truth'], results_all['total_ground_truth']],
        'Detections': [results_val['total_detections'], results_all['total_detections']],
        'Correct': [results_val['total_correct'], results_all['total_correct']],
        'Precision': [results_val['overall_precision'], results_all['overall_precision']],
        'Recall': [results_val['overall_recall'], results_all['overall_recall']],
        'F1-Score': [results_val['overall_f1_score'], results_all['overall_f1_score']]
    }
    
    df_comparison = pd.DataFrame(comparison_data)
    display(df_comparison)
    
    # Visualize the comparison
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    fig.suptitle(f'Model Performance Comparison: {model_name} on {dataset_name}', fontsize=16)
    
    # [Add visualization code here - I can provide the full code if needed]
    
    plt.tight_layout()
    plt.show()
    
else:
    print("❌ Both validation and all-data results are needed for comparison")

## Generate Summary Report

In [None]:
# Generate summary report
if hasattr(benchmark, 'results') and benchmark.results:
    print("�� Generating Summary Report...")
    benchmark.generate_summary_report()
    
    # Also show as a DataFrame
    summary_data = []
    for key, result in benchmark.results.items():
        summary_data.append({
            'Model': Path(result['model_path']).stem,
            'Dataset': result['dataset_name'],
            'Mode': result['mode'],
            'Images': result['total_images'],
            'Precision': f"{result['overall_precision']:.4f}",
            'Recall': f"{result['overall_recall']:.4f}",
            'F1-Score': f"{result['overall_f1_score']:.4f}"
        })
    
    df_summary = pd.DataFrame(summary_data)
    display(df_summary)
    
else:
    print("❌ No benchmark results available for summary")