# YOLO Benchmarking Walkthrough

This guide walks through the complete process of running YOLO benchmarks on various datasets, from initial setup to result analysis and comparison.

## Table of Contents
1. [Initial Setup and Import](#initial-setup-and-import)
2. [Exploring Available Models and Datasets](#exploring-available-models-and-datasets)
3. [Selecting Models and Datasets](#selecting-models-and-datasets)
4. [Running Individual Benchmarks](#running-individual-benchmarks)
5. [Running Multiple Benchmarks](#running-multiple-benchmarks)
6. [Combining and Analyzing Results](#combining-and-analyzing-results)
7. [Comparing Model Performance](#comparing-model-performance)
8. [Advanced Analysis and Visualization](#advanced-analysis-and-visualization)

## Initial Setup and Import

First, ensure you have the required dependencies installed and are in the correct directory:

In [2]:
# Activate virtual environment if using one
# source venv/bin/activate  # On Unix/Mac
# venv\Scripts\activate     # On Windows

# Import the benchmark class
from beeyolo.benchmark_YOLO import YOLOBenchmark

# Initialize the benchmark
benchmark = YOLOBenchmark(
    models_dir="../models",
    datasets_dir="../datasets", 
    benchmarks_dir="../benchmarks"
)

print("✓ Successfully imported YOLOBenchmark")
print("\nInitializing benchmark...")

✓ Successfully imported YOLOBenchmark

Initializing benchmark...


## Exploring Available Models and Datasets

Before running benchmarks, let's see what's available:

In [4]:
# Get available models
models = benchmark.get_available_models()
print("\nAvailable models:")
for i, model in enumerate(models, 1):
    print(f"  {i}. {model}")

# Get available datasets
datasets = benchmark.get_available_datasets()
print("\nAvailable datasets:")
for i, dataset in enumerate(datasets, 1):
    print(f"  {i}. {dataset}")

# Get detailed information about models and datasets (cleaner approach)
print("\nDetailed Information:")
print("Models:")
for model_name in models:
    model_path = benchmark.models_dir / f"{model_name}.pt"
    if model_path.exists():
        file_size = model_path.stat().st_size / (1024 * 1024)  # MB
        print(f"  {model_name}: {file_size:.1f} MB")

print("Datasets:")
for dataset_name in datasets:
    try:
        # Get basic dataset info without verbose logging
        dataset_path = benchmark.datasets_dir / dataset_name
        if (dataset_path / "data.yaml").exists():
            # Count validation images directly
            val_images_dir = dataset_path / "val" / "images"
            val_labels_dir = dataset_path / "val" / "labels"
            
            num_val_images = len(list(val_images_dir.glob("*.png"))) if val_images_dir.exists() else 0
            num_val_labels = len(list(val_labels_dir.glob("*.txt"))) if val_labels_dir.exists() else 0
            
            # Get class count from data.yaml
            import yaml
            with open(dataset_path / "data.yaml", 'r') as f:
                config = yaml.safe_load(f)
                num_classes = config.get('nc', 0)
            
            print(f"  {dataset_name}: {num_classes} classes, {num_val_images} validation images")
        else:
            print(f"  {dataset_name}: No data.yaml found")
    except Exception as e:
        print(f"  {dataset_name}: Error getting info - {e}")


Available models:
  1. mortalityYOLO
  2. beeYOLO
  3. insectYOLO

Available datasets:
  1. ivybee-data
  2. bee-feeder-data
  3. insect-data
  4. alive-dead-data
  5. bumblebee-data
  6. hoverfly-data
  7. honeybee-data

Detailed Information:
Models:
  mortalityYOLO: 6.0 MB
  beeYOLO: 5.9 MB
  insectYOLO: 17.6 MB
Datasets:
  ivybee-data: 2 classes, 38 validation images
  bee-feeder-data: 2 classes, 30 validation images
  insect-data: 2 classes, 166 validation images
  alive-dead-data: 3 classes, 50 validation images
  bumblebee-data: 2 classes, 50 validation images
  hoverfly-data: 2 classes, 38 validation images
  honeybee-data: 2 classes, 40 validation images


## Selecting Models and Datasets

Choose which models and datasets to benchmark:

In [9]:
# Select models and datasets
selected_models = ["insectYOLO", "beeYOLO"]
selected_datasets = ["ivybee-data", "bumblebee-data", "honeybee-data", "hoverfly-data"]

# Run benchmarks for all combinations
all_results = {}
for model in selected_models:
    print(f"\n{'='*60}")
    print(f"TESTING MODEL: {model}")
    print(f"{'='*60}")
    
    # Set the current model
    benchmark.select_model(model)
    
    # Run benchmarks on all selected datasets for this model
    model_results = benchmark.run_all_benchmarks(
        conf_threshold=0.5,
        iou_threshold=0.5
    )
    
    # Store results with model prefix
    for dataset, result in model_results.items():
        key = f"{model}_{dataset}_val"
        all_results[key] = result

print(f"\n✓ All model-dataset combinations completed!")
print(f"Total combinations tested: {len(all_results)}")


TESTING MODEL: insectYOLO
✓ Model selected: insectYOLO

Running benchmark on bumblebee-data
Data mode: val
Evaluating model ../models/insectYOLO.pt on dataset bumblebee-data (val mode)
DEBUG: evaluate_model_on_dataset called with dataset_name='bumblebee-data'
✓ Model loaded: ../models/insectYOLO.pt
Model class names: {0: 'bee', 1: 'feeder'}
DEBUG: About to call load_dataset_info with dataset_name='bumblebee-data'
    load_dataset_info called with dataset_name: 'bumblebee-data'
    Looking for yaml at: ../datasets/bumblebee-data/data.yaml
Dataset: bumblebee-data
Classes: ['bee', 'feeder']
Mode: val
  Creating temp data.yaml for dataset: 'bumblebee-data', mode: 'val'
DEBUG: About to call _create_temp_data_yaml with dataset_name='bumblebee-data'
    _create_temp_data_yaml called with dataset_name: 'bumblebee-data', mode: 'val'
    DEBUG: _create_temp_data_yaml received dataset_name='bumblebee-data'
  Creating temporary data.yaml at: /var/folders/tq/xlt4lphs61q0t09n_9js3mbh0000gn/T/tmpzgo

[34m[1mval: [0mScanning /Users/user/Documents/BEEhaviourLab/BEEhaviourLab-YOLO-training/datasets/bumblebee-data/val/labels.cache... 49 images, 1 backgrounds, 0 corrupt: 100%|██████████| 50/50 [00:00<?, ?it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 4/4 [00:15<00:00,  3.82s/it]


                   all         50        200       0.98       0.98      0.982      0.848
Speed: 2.0ms preprocess, 270.8ms inference, 0.0ms loss, 0.5ms postprocess per image
Results saved to [1m../benchmarks/temp_validation/bumblebee-data_val[0m
✓ Validation completed successfully
Available Ultralytics metrics: ['metrics/precision(B)', 'metrics/recall(B)', 'metrics/mAP50(B)', 'metrics/mAP50-95(B)', 'fitness']
Sample metrics: {'metrics/precision(B)': np.float64(0.9795827688085168), 'metrics/recall(B)': np.float64(0.98), 'metrics/mAP50(B)': np.float64(0.9817751340271206), 'metrics/mAP50-95(B)': np.float64(0.8475792438635674), 'fitness': np.float64(0.8609988328799227)}
    load_dataset_info called with dataset_name: 'bumblebee-data'
    Looking for yaml at: ../datasets/bumblebee-data/data.yaml
  ✓ Benchmark evaluation completed successfully
  Returning results for bumblebee-data
  Cleaning up temporary file: /var/folders/tq/xlt4lphs61q0t09n_9js3mbh0000gn/T/tmpzgoi8j8x/temp_data.yaml
  Cl

[34m[1mval: [0mScanning /Users/user/Documents/BEEhaviourLab/BEEhaviourLab-YOLO-training/datasets/honeybee-data/val/labels.cache... 40 images, 0 backgrounds, 0 corrupt: 100%|██████████| 40/40 [00:00<?, ?it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 3/3 [00:10<00:00,  3.62s/it]


                   all         40         80          1          1      0.995      0.842
Speed: 1.5ms preprocess, 211.3ms inference, 0.0ms loss, 0.5ms postprocess per image
Results saved to [1m../benchmarks/temp_validation/honeybee-data_val[0m
✓ Validation completed successfully
Available Ultralytics metrics: ['metrics/precision(B)', 'metrics/recall(B)', 'metrics/mAP50(B)', 'metrics/mAP50-95(B)', 'fitness']
Sample metrics: {'metrics/precision(B)': np.float64(1.0), 'metrics/recall(B)': np.float64(1.0), 'metrics/mAP50(B)': np.float64(0.995), 'metrics/mAP50-95(B)': np.float64(0.8416463194380779), 'fitness': np.float64(0.8569816874942702)}
    load_dataset_info called with dataset_name: 'honeybee-data'
    Looking for yaml at: ../datasets/honeybee-data/data.yaml
  ✓ Benchmark evaluation completed successfully
  Returning results for honeybee-data
  Cleaning up temporary file: /var/folders/tq/xlt4lphs61q0t09n_9js3mbh0000gn/T/tmpt9rjp7rg/temp_data.yaml
  Cleaning up temporary directory: /v

[34m[1mval: [0mScanning /Users/user/Documents/BEEhaviourLab/BEEhaviourLab-YOLO-training/datasets/ivybee-data/val/labels.cache... 38 images, 0 backgrounds, 0 corrupt: 100%|██████████| 38/38 [00:00<?, ?it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 3/3 [00:09<00:00,  3.19s/it]


                   all         38         76      0.987      0.987      0.989      0.845
Speed: 1.3ms preprocess, 182.3ms inference, 0.0ms loss, 0.3ms postprocess per image
Results saved to [1m../benchmarks/temp_validation/ivybee-data_val[0m
✓ Validation completed successfully
Available Ultralytics metrics: ['metrics/precision(B)', 'metrics/recall(B)', 'metrics/mAP50(B)', 'metrics/mAP50-95(B)', 'fitness']
Sample metrics: {'metrics/precision(B)': np.float64(0.986842105263158), 'metrics/recall(B)': np.float64(0.986842105263158), 'metrics/mAP50(B)': np.float64(0.9894973684210526), 'metrics/mAP50-95(B)': np.float64(0.8452112381591702), 'fitness': np.float64(0.8596398511853585)}
    load_dataset_info called with dataset_name: 'ivybee-data'
    Looking for yaml at: ../datasets/ivybee-data/data.yaml
  ✓ Benchmark evaluation completed successfully
  Returning results for ivybee-data
  Cleaning up temporary file: /var/folders/tq/xlt4lphs61q0t09n_9js3mbh0000gn/T/tmpi3sxb_el/temp_data.yaml
  Cl

[34m[1mval: [0mScanning /Users/user/Documents/BEEhaviourLab/BEEhaviourLab-YOLO-training/datasets/hoverfly-data/val/labels.cache... 38 images, 0 backgrounds, 0 corrupt: 100%|██████████| 38/38 [00:00<?, ?it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 3/3 [00:10<00:00,  3.38s/it]


                   all         38         76          1      0.987      0.991       0.64
Speed: 2.0ms preprocess, 221.4ms inference, 0.0ms loss, 0.3ms postprocess per image
Results saved to [1m../benchmarks/temp_validation/hoverfly-data_val[0m
✓ Validation completed successfully
Available Ultralytics metrics: ['metrics/precision(B)', 'metrics/recall(B)', 'metrics/mAP50(B)', 'metrics/mAP50-95(B)', 'fitness']
Sample metrics: {'metrics/precision(B)': np.float64(1.0), 'metrics/recall(B)': np.float64(0.986842105263158), 'metrics/mAP50(B)': np.float64(0.9907), 'metrics/mAP50-95(B)': np.float64(0.6403265979651396), 'fitness': np.float64(0.6753639381686256)}
    load_dataset_info called with dataset_name: 'hoverfly-data'
    Looking for yaml at: ../datasets/hoverfly-data/data.yaml
  ✓ Benchmark evaluation completed successfully
  Returning results for hoverfly-data
  Cleaning up temporary file: /var/folders/tq/xlt4lphs61q0t09n_9js3mbh0000gn/T/tmppsyzojdr/temp_data.yaml
  Cleaning up temporar

[34m[1mval: [0mScanning /Users/user/Documents/BEEhaviourLab/BEEhaviourLab-YOLO-training/datasets/bumblebee-data/val/labels.cache... 49 images, 1 backgrounds, 0 corrupt: 100%|██████████| 50/50 [00:00<?, ?it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 4/4 [00:11<00:00,  2.97s/it]


                   all         50        200       0.98       0.97       0.98      0.769
Speed: 1.0ms preprocess, 207.1ms inference, 0.0ms loss, 0.7ms postprocess per image
Results saved to [1m../benchmarks/temp_validation/bumblebee-data_val[0m
✓ Validation completed successfully
Available Ultralytics metrics: ['metrics/precision(B)', 'metrics/recall(B)', 'metrics/mAP50(B)', 'metrics/mAP50-95(B)', 'fitness']
Sample metrics: {'metrics/precision(B)': np.float64(0.9797959183673469), 'metrics/recall(B)': np.float64(0.97), 'metrics/mAP50(B)': np.float64(0.9796982928693582), 'metrics/mAP50-95(B)': np.float64(0.7687158649910112), 'fitness': np.float64(0.7898141077788459)}
    load_dataset_info called with dataset_name: 'bumblebee-data'
    Looking for yaml at: ../datasets/bumblebee-data/data.yaml
  ✓ Benchmark evaluation completed successfully
  Returning results for bumblebee-data
  Cleaning up temporary file: /var/folders/tq/xlt4lphs61q0t09n_9js3mbh0000gn/T/tmp80w6lmyw/temp_data.yaml
  Cl

[34m[1mval: [0mScanning /Users/user/Documents/BEEhaviourLab/BEEhaviourLab-YOLO-training/datasets/honeybee-data/val/labels.cache... 40 images, 0 backgrounds, 0 corrupt: 100%|██████████| 40/40 [00:00<?, ?it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 3/3 [00:09<00:00,  3.22s/it]


                   all         40         80      0.762      0.666      0.741       0.39
Speed: 1.9ms preprocess, 196.3ms inference, 0.0ms loss, 0.6ms postprocess per image
Results saved to [1m../benchmarks/temp_validation/honeybee-data_val[0m
✓ Validation completed successfully
Available Ultralytics metrics: ['metrics/precision(B)', 'metrics/recall(B)', 'metrics/mAP50(B)', 'metrics/mAP50-95(B)', 'fitness']
Sample metrics: {'metrics/precision(B)': np.float64(0.7616612114796347), 'metrics/recall(B)': np.float64(0.6662291748766561), 'metrics/mAP50(B)': np.float64(0.7407731232844016), 'metrics/mAP50-95(B)': np.float64(0.38974540092158444), 'fitness': np.float64(0.42484817315786616)}
    load_dataset_info called with dataset_name: 'honeybee-data'
    Looking for yaml at: ../datasets/honeybee-data/data.yaml
  ✓ Benchmark evaluation completed successfully
  Returning results for honeybee-data
  Cleaning up temporary file: /var/folders/tq/xlt4lphs61q0t09n_9js3mbh0000gn/T/tmpr3umo1pc/temp_da

[34m[1mval: [0mScanning /Users/user/Documents/BEEhaviourLab/BEEhaviourLab-YOLO-training/datasets/ivybee-data/val/labels.cache... 38 images, 0 backgrounds, 0 corrupt: 100%|██████████| 38/38 [00:00<?, ?it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 3/3 [00:09<00:00,  3.16s/it]


                   all         38         76        0.5      0.447      0.474      0.368
Speed: 2.8ms preprocess, 212.2ms inference, 0.0ms loss, 0.3ms postprocess per image
Results saved to [1m../benchmarks/temp_validation/ivybee-data_val[0m
✓ Validation completed successfully
Available Ultralytics metrics: ['metrics/precision(B)', 'metrics/recall(B)', 'metrics/mAP50(B)', 'metrics/mAP50-95(B)', 'fitness']
Sample metrics: {'metrics/precision(B)': np.float64(0.5), 'metrics/recall(B)': np.float64(0.4473684210526316), 'metrics/mAP50(B)': np.float64(0.4736250000000001), 'metrics/mAP50-95(B)': np.float64(0.3677767737104554), 'fitness': np.float64(0.3783615963394099)}
    load_dataset_info called with dataset_name: 'ivybee-data'
    Looking for yaml at: ../datasets/ivybee-data/data.yaml
  ✓ Benchmark evaluation completed successfully
  Returning results for ivybee-data
  Cleaning up temporary file: /var/folders/tq/xlt4lphs61q0t09n_9js3mbh0000gn/T/tmpp723ux0d/temp_data.yaml
  Cleaning up tem

[34m[1mval: [0mScanning /Users/user/Documents/BEEhaviourLab/BEEhaviourLab-YOLO-training/datasets/hoverfly-data/val/labels.cache... 38 images, 0 backgrounds, 0 corrupt: 100%|██████████| 38/38 [00:00<?, ?it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 3/3 [00:09<00:00,  3.06s/it]


                   all         38         76      0.833      0.671      0.747      0.357
Speed: 1.0ms preprocess, 187.1ms inference, 0.0ms loss, 0.4ms postprocess per image
Results saved to [1m../benchmarks/temp_validation/hoverfly-data_val[0m
✓ Validation completed successfully
Available Ultralytics metrics: ['metrics/precision(B)', 'metrics/recall(B)', 'metrics/mAP50(B)', 'metrics/mAP50-95(B)', 'fitness']
Sample metrics: {'metrics/precision(B)': np.float64(0.8333333333333333), 'metrics/recall(B)': np.float64(0.6710526315789473), 'metrics/mAP50(B)': np.float64(0.7466833333333334), 'metrics/mAP50-95(B)': np.float64(0.35745484648918474), 'fitness': np.float64(0.3963776951735996)}
    load_dataset_info called with dataset_name: 'hoverfly-data'
    Looking for yaml at: ../datasets/hoverfly-data/data.yaml
  ✓ Benchmark evaluation completed successfully
  Returning results for hoverfly-data
  Cleaning up temporary file: /var/folders/tq/xlt4lphs61q0t09n_9js3mbh0000gn/T/tmpia1jdqr7/temp_dat

## Combining and Analyzing Results

Now combine all the JSON files in the benchmarks folder and analyze the results:

In [10]:
import json
import pandas as pd
from pathlib import Path
from datetime import datetime

def load_all_benchmark_results(benchmarks_dir):
    """Load all benchmark results from JSON files in the benchmarks directory."""
    results = []
    benchmarks_path = Path(benchmarks_dir)
    
    # Find all JSON files
    json_files = list(benchmarks_path.glob("benchmark_results_*.json"))
    
    for json_file in json_files:
        try:
            with open(json_file, 'r') as f:
                data = json.load(f)
                # Add filename and timestamp for reference
                data['filename'] = json_file.name
                data['file_timestamp'] = json_file.stem.split('_')[-1]
                results.append(data)
        except Exception as e:
            print(f"Error loading {json_file}: {e}")
    
    return results

def create_results_dataframe(results):
    """Convert results to a pandas DataFrame for easy analysis."""
    df_data = []
    
    for result in results:
        df_data.append({
            'Model': Path(result['model_path']).stem,
            'Dataset': result['dataset_name'],
            'Mode': result['mode'],
            'Total Images': result['total_images'],
            'Total Ground Truth': result['total_ground_truth'],
            'Precision': result['overall_precision'],
            'Recall': result['overall_recall'],
            'F1-Score': result['overall_f1_score'],
            'mAP50': result['overall_map50'],
            'mAP50-95': result['overall_map50_95'],
            'Confidence Threshold': result['conf_threshold'],
            'IoU Threshold': result['iou_threshold'],
            'Filename': result['filename'],
            'Timestamp': result['file_timestamp']
        })
    
    return pd.DataFrame(df_data)

# Load and combine all results
print("Loading all benchmark results...")
all_results = load_all_benchmark_results("../benchmarks")
print(f"✓ Loaded {len(all_results)} benchmark result files")

# Create DataFrame
results_df = create_results_dataframe(all_results)
print(f"\nResults DataFrame shape: {results_df.shape}")
print("\nFirst few rows:")
print(results_df.head())

# Save combined results
combined_file = "../benchmarks/combined_benchmark_results.json"
with open(combined_file, 'w') as f:
    json.dump(all_results, f, indent=2)
print(f"\n✓ Combined results saved to: {combined_file}")

# Save as CSV for easy analysis
csv_file = "../benchmarks/combined_benchmark_results.csv"
results_df.to_csv(csv_file, index=False)
print(f"✓ Results DataFrame saved to: {csv_file}")

Loading all benchmark results...
✓ Loaded 8 benchmark result files

Results DataFrame shape: (8, 14)

First few rows:
        Model         Dataset Mode  Total Images  Total Ground Truth  \
0     beeYOLO  bumblebee-data  val            50                 200   
1  insectYOLO   honeybee-data  val            40                  80   
2     beeYOLO     ivybee-data  val            38                  76   
3  insectYOLO   hoverfly-data  val            38                  76   
4     beeYOLO   hoverfly-data  val            38                  76   

   Precision    Recall  F1-Score     mAP50  mAP50-95  Confidence Threshold  \
0   0.979796  0.970000  0.979698  0.979698  0.768716                   0.5   
1   1.000000  1.000000  0.995000  0.995000  0.841646                   0.5   
2   0.500000  0.447368  0.473625  0.473625  0.367777                   0.5   
3   1.000000  0.986842  0.990700  0.990700  0.640327                   0.5   
4   0.833333  0.671053  0.746683  0.746683  0.357455       

## Comparing Model Performance

Now analyze and compare the performance across different models and/or datasets:

In [11]:
def analyze_model_performance(results_df):
    """Analyze performance across different models and datasets."""
    
    print("="*80)
    print("MODEL PERFORMANCE ANALYSIS")
    print("="*80)
    
    # Overall performance by model
    print("\n1. OVERALL PERFORMANCE BY MODEL:")
    model_performance = results_df.groupby('Model').agg({
        'Precision': ['mean', 'std', 'count'],
        'Recall': ['mean', 'std'],
        'F1-Score': ['mean', 'std'],
        'mAP50': ['mean', 'std'],
        'mAP50-95': ['mean', 'std']
    }).round(4)
    
    print(model_performance)
    
    # Performance by dataset
    print("\n2. PERFORMANCE BY DATASET:")
    dataset_performance = results_df.groupby('Dataset').agg({
        'Precision': ['mean', 'std', 'count'],
        'Recall': ['mean', 'std'],
        'F1-Score': ['mean', 'std'],
        'mAP50': ['mean', 'std'],
        'mAP50-95': ['mean', 'std']
    }).round(4)
    
    print(dataset_performance)
    
    # Model-Dataset combinations
    print("\n3. MODEL-DATASET COMBINATIONS:")
    combo_performance = results_df.groupby(['Model', 'Dataset']).agg({
        'Precision': 'mean',
        'Recall': 'mean',
        'F1-Score': 'mean',
        'mAP50': 'mean',
        'mAP50-95': 'mean'
    }).round(4)
    
    print(combo_performance)
    
    return model_performance, dataset_performance, combo_performance

# Run the analysis
model_perf, dataset_perf, combo_perf = analyze_model_performance(results_df)

MODEL PERFORMANCE ANALYSIS

1. OVERALL PERFORMANCE BY MODEL:
           Precision                Recall         F1-Score           mAP50  \
                mean     std count    mean     std     mean     std    mean   
Model                                                                         
beeYOLO       0.7687  0.2008     4  0.6887  0.2146   0.7352  0.2069  0.7352   
insectYOLO    0.9916  0.0101     4  0.9884  0.0084   0.9892  0.0055  0.9892   

                   mAP50-95          
               std     mean     std  
Model                                
beeYOLO     0.2069   0.4709  0.1990  
insectYOLO  0.0055   0.7937  0.1023  

2. PERFORMANCE BY DATASET:
               Precision                Recall         F1-Score          \
                    mean     std count    mean     std     mean     std   
Dataset                                                                   
bumblebee-data    0.9797  0.0002     2  0.9750  0.0071   0.9807  0.0015   
honeybee-data     0.8808 

# Visualization: Performance Barplots

Now create barplots for each metric, grouped by model with datasets on the x-axis:

In [15]:
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

def create_performance_barplots(results_df, output_dir="../benchmarks/comparisons"):
    """Create barplots for each performance metric, grouped by model."""
    
    # Create output directory
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)
    
    # Set up the plotting style
    plt.style.use('default')
    sns.set_palette("husl")
    
    # Define metrics to plot
    metrics = ['Precision', 'Recall', 'F1-Score', 'mAP50', 'mAP50-95']
    
    # Create a figure with subplots for each metric
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    fig.suptitle('Model Performance Comparison Across Datasets', fontsize=16, fontweight='bold')
    
    # Flatten axes for easier iteration
    axes = axes.flatten()
    
    for i, metric in enumerate(metrics):
        if i < len(axes):
            ax = axes[i]
            
            # Create pivot table for the current metric
            pivot_data = results_df.pivot(index='Dataset', columns='Model', values=metric)
            
            # Create grouped bar plot
            pivot_data.plot(kind='bar', ax=ax, width=0.8)
            
            ax.set_title(f'{metric}', fontweight='bold')
            ax.set_xlabel('Dataset')
            ax.set_ylabel(metric)
            ax.legend(title='Model', bbox_to_anchor=(1.05, 1), loc='upper left')
            ax.tick_params(axis='x', rotation=45)
            ax.grid(True, alpha=0.3)
            
            # Add value labels on bars
            for container in ax.containers:
                ax.bar_label(container, fmt='%.3f', fontsize=8)
    
    # Remove the extra subplot if we have 5 metrics
    if len(metrics) < 6:
        axes[-1].remove()
    
    plt.tight_layout()
    plt.subplots_adjust(top=0.95)
    
    # Save the combined plot
    combined_plot_path = output_path / "performance_comparison_all_metrics.png"
    plt.savefig(combined_plot_path, dpi=300, bbox_inches='tight')
    print(f"✓ Combined performance plot saved to: {combined_plot_path}")
    
    # Create individual plots for each metric
    for metric in metrics:
        plt.figure(figsize=(12, 8))
        
        # Create pivot table for the current metric
        pivot_data = results_df.pivot(index='Dataset', columns='Model', values=metric)
        
        # Create grouped bar plot
        ax = pivot_data.plot(kind='bar', width=0.8)
        
        plt.title(f'{metric} Comparison Across Models and Datasets', fontsize=14, fontweight='bold')
        plt.xlabel('Dataset', fontsize=12)
        plt.ylabel(metric, fontsize=12)
        plt.legend(title='Model', bbox_to_anchor=(1.05, 1), loc='upper left')
        plt.xticks(rotation=45)
        plt.grid(True, alpha=0.3)
        
        # Add value labels on bars
        for container in ax.containers:
            ax.bar_label(container, fmt='%.3f', fontsize=10)
        
        plt.tight_layout()
        
        # Save individual plot
        individual_plot_path = output_path / f"performance_{metric.lower().replace('-', '_')}.png"
        plt.savefig(individual_plot_path, dpi=300, bbox_inches='tight')
        print(f"✓ {metric} plot saved to: {individual_plot_path}")
        
        plt.close()
    
    # Create a summary heatmap
    plt.figure(figsize=(14, 10))
    
    # Create pivot table for mAP50-95 (most important metric)
    heatmap_data = results_df.pivot(index='Dataset', columns='Model', values='mAP50-95')
    
    # Create heatmap
    sns.heatmap(heatmap_data, annot=True, fmt='.3f', cmap='RdYlGn', 
                center=0.5, vmin=0, vmax=1, cbar_kws={'label': 'mAP50-95'})
    
    plt.title('mAP50-95 Performance Heatmap', fontsize=16, fontweight='bold')
    plt.xlabel('Model', fontsize=12)
    plt.ylabel('Dataset', fontsize=12)
    
    # Save heatmap
    heatmap_path = output_path / "performance_heatmap_map50_95.png"
    plt.savefig(heatmap_path, dpi=300, bbox_inches='tight')
    print(f"✓ Performance heatmap saved to: {heatmap_path}")
    
    plt.close()
    
    print(f"\n✓ All visualization files saved to: {output_path}")
    return output_path

# Create the performance barplots
print("\nCreating performance visualizations...")
visualization_dir = create_performance_barplots(results_df)
print(f"✓ Visualizations completed! Check: {visualization_dir}")


Creating performance visualizations...
✓ Combined performance plot saved to: ../benchmarks/comparisons/performance_comparison_all_metrics.png
✓ Precision plot saved to: ../benchmarks/comparisons/performance_precision.png
✓ Recall plot saved to: ../benchmarks/comparisons/performance_recall.png
✓ F1-Score plot saved to: ../benchmarks/comparisons/performance_f1_score.png
✓ mAP50 plot saved to: ../benchmarks/comparisons/performance_map50.png
✓ mAP50-95 plot saved to: ../benchmarks/comparisons/performance_map50_95.png
✓ Performance heatmap saved to: ../benchmarks/comparisons/performance_heatmap_map50_95.png

✓ All visualization files saved to: ../benchmarks/comparisons
✓ Visualizations completed! Check: ../benchmarks/comparisons
