# üìà Notebook 4: Comprehensive Model Evaluation

## Military Object Detection - Performance Analysis

This notebook provides in-depth evaluation of trained YOLOv8 models with detailed metrics and visualizations.

### Objectives:
1. **Model Loading**: Load best trained model(s)
2. **Validation Metrics**: Compute mAP, precision, recall
3. **Per-Class Analysis**: Detailed class-wise performance
4. **Confusion Matrix**: Analyze classification errors
5. **PR Curves**: Precision-Recall analysis
6. **Failure Analysis**: Identify model weaknesses

---

## 1. Setup & Imports

In [6]:
# Standard imports
import os
import sys
from pathlib import Path
import json
import warnings

# Data manipulation
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import seaborn as sns
from PIL import Image

# Computer Vision
import cv2

# YAML
import yaml

# Deep Learning
import torch
from ultralytics import YOLO

# Metrics
from sklearn.metrics import confusion_matrix, classification_report

# Progress
from tqdm.notebook import tqdm

# Suppress warnings
warnings.filterwarnings('ignore')

# Plotting style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

print("‚úÖ All imports successful!")

‚úÖ All imports successful!


In [7]:
# Define paths
PROJECT_ROOT = Path('..')
DATASET_ROOT = PROJECT_ROOT / 'military_object_dataset'
CONFIG_DIR = PROJECT_ROOT / 'config'
MODELS_DIR = PROJECT_ROOT / 'models'
RUNS_DIR = PROJECT_ROOT / 'runs'
RESULTS_DIR = PROJECT_ROOT / 'results'
FIGURES_DIR = PROJECT_ROOT / 'figures'

# Dataset paths
VAL_IMAGES = DATASET_ROOT / 'val' / 'images'
VAL_LABELS = DATASET_ROOT / 'val' / 'labels'

# Create results directory
RESULTS_DIR.mkdir(exist_ok=True)

# Load configuration
with open(CONFIG_DIR / 'dataset.yaml', 'r') as f:
    dataset_config = yaml.safe_load(f)

CLASS_NAMES = dataset_config['names']
NUM_CLASSES = dataset_config['nc']

print(f"üìã Loaded configuration with {NUM_CLASSES} classes")

üìã Loaded configuration with 12 classes


## 2. Load Trained Model(s)

In [8]:
def find_best_models(runs_dir: Path) -> list:
    """
    Find all trained model checkpoints.
    """
    models = []
    detect_dir = runs_dir / 'detect'
    
    if detect_dir.exists():
        for exp_dir in detect_dir.iterdir():
            if exp_dir.is_dir():
                best_pt = exp_dir / 'weights' / 'best.pt'
                if best_pt.exists():
                    models.append({
                        'name': exp_dir.name,
                        'path': str(best_pt),
                        'results_csv': exp_dir / 'results.csv'
                    })
    
    # Also check models directory
    best_model = MODELS_DIR / 'best_model.pt'
    if best_model.exists():
        models.append({
            'name': 'best_model',
            'path': str(best_model),
            'results_csv': None
        })
    
    return models

# Find available models
available_models = find_best_models(RUNS_DIR)

print("üîç Available Trained Models:")
print("=" * 50)
for i, model in enumerate(available_models):
    print(f"   {i+1}. {model['name']}")
    print(f"      Path: {model['path']}")

üîç Available Trained Models:
   1. best_model
      Path: ../models/best_model.pt


In [9]:
# Load the best model (or specify which one to evaluate)
# Default: use the first available model

if available_models:
    selected_model = available_models[0]  # Change index to select different model
    model = YOLO(selected_model['path'])
    print(f"‚úÖ Loaded model: {selected_model['name']}")
else:
    print("‚ö†Ô∏è No trained models found! Please run Notebook 03 first.")
    print("   Alternatively, specify a model path manually:")
    print("   model = YOLO('path/to/your/model.pt')")
    
    # For demo purposes, load a pretrained model
    # model = YOLO('yolov8n.pt')  # Uncomment to use pretrained

‚úÖ Loaded model: best_model


## 3. Run Validation

In [10]:
# Run validation on the validation set
print("üîç Running validation...")

val_results = model.val(
    data=str(CONFIG_DIR / 'dataset.yaml'),
    batch=16,
    imgsz=640,
    conf=0.001,  # Low confidence for full curve
    iou=0.6,
    plots=True,
    save_json=True,
    verbose=True
)

üîç Running validation...
Ultralytics 8.3.237 üöÄ Python-3.13.7 torch-2.9.1 CPU (Apple M4)
Model summary (fused): 72 layers, 11,130,228 parameters, 0 gradients, 28.5 GFLOPs
[34m[1mval: [0mFast image access ‚úÖ (ping: 0.0¬±0.0 ms, read: 326.0¬±276.8 MB/s, size: 105.2 KB)
[K[34m[1mval: [0mScanning /Users/Hetansh/Github/smart-serve-hackathon-submission/military_object_dataset/val/labels.cache... 2941 images, 273 backgrounds, 0 corrupt: 100% ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ 2941/2941 8.2Mit/s 0.0s
[K                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100% ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ 184/184 3.2s/it 9:533.0ss
                   all       2941       5081      0.603       0.41      0.426      0.261
    camouflage_soldier        385        510      0.698      0.633      0.674       0.33
                weapon        222        358      0.611      0.453      0.479      0.319
         military_tank        938       1787      0.758     

In [11]:
# Display overall metrics
print("\nüìä Overall Validation Metrics:")
print("=" * 50)
print(f"   mAP@0.5:      {val_results.box.map50:.4f}")
print(f"   mAP@0.5:0.95: {val_results.box.map:.4f}")
print(f"   Precision:    {val_results.box.mp:.4f}")
print(f"   Recall:       {val_results.box.mr:.4f}")


üìä Overall Validation Metrics:
   mAP@0.5:      0.4258
   mAP@0.5:0.95: 0.2609
   Precision:    0.6028
   Recall:       0.4101


## 4. Per-Class Performance Analysis

In [12]:
# Extract per-class metrics
per_class_metrics = []

for i in range(NUM_CLASSES):
    per_class_metrics.append({
        'Class ID': i,
        'Class Name': CLASS_NAMES[i],
        'AP@0.5': val_results.box.ap50[i] if i < len(val_results.box.ap50) else 0,
        'AP@0.5:0.95': val_results.box.ap[i] if i < len(val_results.box.ap) else 0,
    })

per_class_df = pd.DataFrame(per_class_metrics)
per_class_df = per_class_df.sort_values('AP@0.5', ascending=False).reset_index(drop=True)

print("üìä Per-Class Performance (sorted by AP@0.5):")
print("=" * 60)
display(per_class_df.round(4))

üìä Per-Class Performance (sorted by AP@0.5):


Unnamed: 0,Class ID,Class Name,AP@0.5,AP@0.5:0.95
0,10,military_aircraft,0.8206,0.5933
1,2,military_tank,0.8159,0.5152
2,0,camouflage_soldier,0.6742,0.3297
3,6,soldier,0.6293,0.3289
4,1,weapon,0.4788,0.319
5,3,military_truck,0.4035,0.2524
6,4,military_vehicle,0.3951,0.2841
7,7,civilian_vehicle,0.2553,0.1342
8,8,military_artillery,0.2109,0.1136
9,5,civilian,0.0,0.0


In [13]:
# Visualize per-class AP
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# AP@0.5 bar chart
colors = plt.cm.RdYlGn(per_class_df['AP@0.5'].values)

bars1 = axes[0].barh(per_class_df['Class Name'], per_class_df['AP@0.5'], color=colors)
axes[0].set_xlabel('AP@0.5', fontsize=12)
axes[0].set_ylabel('Class', fontsize=12)
axes[0].set_title('Per-Class AP@0.5', fontsize=14, fontweight='bold')
axes[0].set_xlim([0, 1])
axes[0].axvline(0.5, color='gray', linestyle='--', alpha=0.5, label='Threshold (0.5)')

# Add value labels
for bar, val in zip(bars1, per_class_df['AP@0.5']):
    axes[0].text(bar.get_width() + 0.01, bar.get_y() + bar.get_height()/2,
                 f'{val:.3f}', va='center', fontsize=9)

# AP@0.5:0.95 bar chart
colors2 = plt.cm.RdYlGn(per_class_df['AP@0.5:0.95'].values)

bars2 = axes[1].barh(per_class_df['Class Name'], per_class_df['AP@0.5:0.95'], color=colors2)
axes[1].set_xlabel('AP@0.5:0.95', fontsize=12)
axes[1].set_ylabel('Class', fontsize=12)
axes[1].set_title('Per-Class AP@0.5:0.95', fontsize=14, fontweight='bold')
axes[1].set_xlim([0, 1])

# Add value labels
for bar, val in zip(bars2, per_class_df['AP@0.5:0.95']):
    axes[1].text(bar.get_width() + 0.01, bar.get_y() + bar.get_height()/2,
                 f'{val:.3f}', va='center', fontsize=9)

plt.tight_layout()
plt.savefig('../figures/12_per_class_ap.png', dpi=150, bbox_inches='tight')
plt.show()

print("üíæ Figure saved to: figures/12_per_class_ap.png")

<Figure size 1600x600 with 2 Axes>

üíæ Figure saved to: figures/12_per_class_ap.png


In [14]:
# Identify best and worst performing classes
print("\nüéØ Performance Summary:")
print("=" * 50)

# Best classes (AP@0.5 > 0.7)
best_classes = per_class_df[per_class_df['AP@0.5'] > 0.7]
print(f"\n‚úÖ Well-Performing Classes (AP@0.5 > 0.7): {len(best_classes)}")
for _, row in best_classes.iterrows():
    print(f"   ‚Ä¢ {row['Class Name']}: {row['AP@0.5']:.4f}")

# Poor classes (AP@0.5 < 0.3)
poor_classes = per_class_df[per_class_df['AP@0.5'] < 0.3]
print(f"\n‚ö†Ô∏è Challenging Classes (AP@0.5 < 0.3): {len(poor_classes)}")
for _, row in poor_classes.iterrows():
    print(f"   ‚Ä¢ {row['Class Name']}: {row['AP@0.5']:.4f}")


üéØ Performance Summary:

‚úÖ Well-Performing Classes (AP@0.5 > 0.7): 2
   ‚Ä¢ military_aircraft: 0.8206
   ‚Ä¢ military_tank: 0.8159

‚ö†Ô∏è Challenging Classes (AP@0.5 < 0.3): 5
   ‚Ä¢ civilian_vehicle: 0.2553
   ‚Ä¢ military_artillery: 0.2109
   ‚Ä¢ civilian: 0.0000
   ‚Ä¢ trench: 0.0000
   ‚Ä¢ military_warship: 0.0000


## 5. Confusion Matrix Analysis

In [15]:
def collect_predictions(model, images_dir: Path, labels_dir: Path, conf_threshold: float = 0.25) -> tuple:
    """
    Collect ground truth and predictions for confusion matrix.
    """
    all_gt = []
    all_pred = []
    
    image_files = list(images_dir.glob('*.jpg'))
    
    for img_path in tqdm(image_files[:500], desc="Collecting predictions"):  # Sample for speed
        # Get ground truth
        label_path = labels_dir / f"{img_path.stem}.txt"
        gt_classes = []
        
        if label_path.exists():
            with open(label_path, 'r') as f:
                for line in f:
                    if line.strip():
                        gt_classes.append(int(line.strip().split()[0]))
        
        # Get predictions
        results = model.predict(str(img_path), conf=conf_threshold, verbose=False)
        pred_classes = results[0].boxes.cls.cpu().numpy().astype(int).tolist() if len(results[0].boxes) > 0 else []
        
        # Match predictions to ground truth (simplified matching)
        for gt_cls in gt_classes:
            if gt_cls in pred_classes:
                all_gt.append(gt_cls)
                all_pred.append(gt_cls)  # Correct prediction
                pred_classes.remove(gt_cls)  # Remove matched
            else:
                all_gt.append(gt_cls)
                all_pred.append(-1)  # Missed (background)
        
        # False positives
        for pred_cls in pred_classes:
            all_gt.append(-1)  # Background
            all_pred.append(pred_cls)
    
    return np.array(all_gt), np.array(all_pred)

In [16]:
# Collect predictions for confusion matrix
print("üîç Collecting predictions for confusion matrix...")
gt_labels, pred_labels = collect_predictions(model, VAL_IMAGES, VAL_LABELS, conf_threshold=0.25)

üîç Collecting predictions for confusion matrix...


Collecting predictions:   0%|          | 0/500 [00:00<?, ?it/s]

In [17]:
# Create confusion matrix (only for valid classes, excluding background)
mask = (gt_labels >= 0) & (pred_labels >= 0)
gt_valid = gt_labels[mask]
pred_valid = pred_labels[mask]

if len(gt_valid) > 0:
    # Compute confusion matrix
    cm = confusion_matrix(gt_valid, pred_valid, labels=list(range(NUM_CLASSES)))
    
    # Normalize by row (recall normalization)
    cm_normalized = cm.astype('float') / (cm.sum(axis=1, keepdims=True) + 1e-10)
    
    # Plot confusion matrix
    fig, axes = plt.subplots(1, 2, figsize=(20, 8))
    
    # Raw counts
    sns.heatmap(
        cm, 
        annot=True, 
        fmt='d', 
        cmap='Blues',
        xticklabels=[CLASS_NAMES[i] for i in range(NUM_CLASSES)],
        yticklabels=[CLASS_NAMES[i] for i in range(NUM_CLASSES)],
        ax=axes[0]
    )
    axes[0].set_xlabel('Predicted', fontsize=12)
    axes[0].set_ylabel('Actual', fontsize=12)
    axes[0].set_title('Confusion Matrix (Counts)', fontsize=14, fontweight='bold')
    axes[0].tick_params(axis='x', rotation=45)
    axes[0].tick_params(axis='y', rotation=0)
    
    # Normalized
    sns.heatmap(
        cm_normalized, 
        annot=True, 
        fmt='.2f', 
        cmap='YlOrRd',
        xticklabels=[CLASS_NAMES[i] for i in range(NUM_CLASSES)],
        yticklabels=[CLASS_NAMES[i] for i in range(NUM_CLASSES)],
        ax=axes[1],
        vmin=0, vmax=1
    )
    axes[1].set_xlabel('Predicted', fontsize=12)
    axes[1].set_ylabel('Actual', fontsize=12)
    axes[1].set_title('Confusion Matrix (Normalized)', fontsize=14, fontweight='bold')
    axes[1].tick_params(axis='x', rotation=45)
    axes[1].tick_params(axis='y', rotation=0)
    
    plt.tight_layout()
    plt.savefig('../figures/13_confusion_matrix.png', dpi=150, bbox_inches='tight')
    plt.show()
    
    print("üíæ Figure saved to: figures/13_confusion_matrix.png")
else:
    print("‚ö†Ô∏è Not enough valid predictions for confusion matrix")

<Figure size 2000x800 with 4 Axes>

üíæ Figure saved to: figures/13_confusion_matrix.png


In [19]:
confusions_df = pd.DataFrame(confusions)

if len(confusions_df) > 0:
    confusions_df = confusions_df.sort_values('Count', ascending=False).head(10)
    print("\nTop 10 Confusions:")
    display(confusions_df)
else:
    print("\n‚úÖ No confusions detected!")


‚úÖ No confusions detected!


## 6. Precision-Recall Curves

In [20]:
# Load PR curve data if available from validation
pr_curve_path = Path(val_results.save_dir) / 'PR_curve.png'

if pr_curve_path.exists():
    # Display the PR curve generated by YOLO
    img = Image.open(pr_curve_path)
    plt.figure(figsize=(12, 10))
    plt.imshow(img)
    plt.axis('off')
    plt.title('Precision-Recall Curves', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.savefig('../figures/14_pr_curves.png', dpi=150, bbox_inches='tight')
    plt.show()
    print("üíæ Figure saved to: figures/14_pr_curves.png")
else:
    print("‚ö†Ô∏è PR curve not found. It should be generated during validation.")

‚ö†Ô∏è PR curve not found. It should be generated during validation.


In [21]:
# Display other validation plots if available
plot_files = [
    ('F1_curve.png', 'F1 Score Curve'),
    ('P_curve.png', 'Precision Curve'),
    ('R_curve.png', 'Recall Curve'),
]

available_plots = []
for filename, title in plot_files:
    plot_path = Path(val_results.save_dir) / filename
    if plot_path.exists():
        available_plots.append((plot_path, title))

if available_plots:
    fig, axes = plt.subplots(1, len(available_plots), figsize=(6*len(available_plots), 5))
    if len(available_plots) == 1:
        axes = [axes]
    
    for ax, (path, title) in zip(axes, available_plots):
        img = Image.open(path)
        ax.imshow(img)
        ax.axis('off')
        ax.set_title(title, fontsize=12, fontweight='bold')
    
    plt.tight_layout()
    plt.savefig('../figures/15_metric_curves.png', dpi=150, bbox_inches='tight')
    plt.show()
    print("üíæ Figure saved to: figures/15_metric_curves.png")

## 7. Detection Visualization

In [22]:
def visualize_detection(model, image_path: Path, conf_threshold: float = 0.25, ax=None):
    """
    Visualize model detection on an image.
    """
    # Run prediction
    results = model.predict(str(image_path), conf=conf_threshold, verbose=False)
    
    # Get annotated image
    annotated = results[0].plot()
    annotated = cv2.cvtColor(annotated, cv2.COLOR_BGR2RGB)
    
    if ax is None:
        fig, ax = plt.subplots(1, 1, figsize=(10, 10))
    
    ax.imshow(annotated)
    ax.axis('off')
    
    # Get detection info
    num_detections = len(results[0].boxes)
    ax.set_title(f"{image_path.name} ({num_detections} detections)", fontsize=10)
    
    return results[0]

In [23]:
# Visualize sample detections
np.random.seed(42)
sample_images = np.random.choice(list(VAL_IMAGES.glob('*.jpg')), size=9, replace=False)

fig, axes = plt.subplots(3, 3, figsize=(18, 18))
axes = axes.flatten()

for idx, img_path in enumerate(sample_images):
    visualize_detection(model, img_path, conf_threshold=0.25, ax=axes[idx])

plt.suptitle('Sample Detections on Validation Set', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('../figures/16_sample_detections.png', dpi=150, bbox_inches='tight')
plt.show()

print("üíæ Figure saved to: figures/16_sample_detections.png")

<Figure size 1800x1800 with 9 Axes>

üíæ Figure saved to: figures/16_sample_detections.png


## 8. Confidence Distribution Analysis

In [24]:
def collect_confidence_scores(model, images_dir: Path, n_samples: int = 200) -> dict:
    """
    Collect confidence scores for all detections.
    """
    confidence_by_class = {i: [] for i in range(NUM_CLASSES)}
    
    image_files = list(images_dir.glob('*.jpg'))[:n_samples]
    
    for img_path in tqdm(image_files, desc="Collecting confidence scores"):
        results = model.predict(str(img_path), conf=0.01, verbose=False)
        
        if len(results[0].boxes) > 0:
            classes = results[0].boxes.cls.cpu().numpy().astype(int)
            confs = results[0].boxes.conf.cpu().numpy()
            
            for cls, conf in zip(classes, confs):
                confidence_by_class[cls].append(conf)
    
    return confidence_by_class

In [25]:
# Collect confidence scores
print("üîç Analyzing confidence distributions...")
confidence_scores = collect_confidence_scores(model, VAL_IMAGES, n_samples=200)

üîç Analyzing confidence distributions...


Collecting confidence scores:   0%|          | 0/200 [00:00<?, ?it/s]

In [26]:
# Visualize confidence distributions
fig, axes = plt.subplots(3, 4, figsize=(20, 12))
axes = axes.flatten()

for class_id in range(NUM_CLASSES):
    ax = axes[class_id]
    scores = confidence_scores[class_id]
    
    if len(scores) > 0:
        ax.hist(scores, bins=20, color='#3498db', alpha=0.7, edgecolor='black')
        ax.axvline(np.mean(scores), color='red', linestyle='--', 
                   label=f'Mean: {np.mean(scores):.2f}')
        ax.axvline(0.5, color='green', linestyle=':', alpha=0.7,
                   label='Threshold: 0.5')
        ax.legend(fontsize=8)
    else:
        ax.text(0.5, 0.5, 'No detections', ha='center', va='center')
    
    ax.set_xlabel('Confidence')
    ax.set_ylabel('Count')
    ax.set_title(f"{CLASS_NAMES[class_id]}\n(n={len(scores)})", fontsize=10)
    ax.set_xlim([0, 1])

plt.suptitle('Confidence Score Distribution by Class', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('../figures/17_confidence_distributions.png', dpi=150, bbox_inches='tight')
plt.show()

print("üíæ Figure saved to: figures/17_confidence_distributions.png")

<Figure size 2000x1200 with 12 Axes>

üíæ Figure saved to: figures/17_confidence_distributions.png


## 9. Summary Report

In [27]:
# Generate summary report
print("=" * 70)
print("üìä EVALUATION SUMMARY REPORT")
print("=" * 70)

print(f"\nüîß MODEL: {selected_model['name'] if available_models else 'N/A'}")

print("\nüìà OVERALL METRICS:")
print(f"   mAP@0.5:      {val_results.box.map50:.4f}")
print(f"   mAP@0.5:0.95: {val_results.box.map:.4f}")
print(f"   Precision:    {val_results.box.mp:.4f}")
print(f"   Recall:       {val_results.box.mr:.4f}")

print("\nüèÜ BEST PERFORMING CLASSES (AP@0.5):")
for _, row in per_class_df.head(3).iterrows():
    print(f"   ‚Ä¢ {row['Class Name']}: {row['AP@0.5']:.4f}")

print("\n‚ö†Ô∏è WORST PERFORMING CLASSES (AP@0.5):")
for _, row in per_class_df.tail(3).iterrows():
    print(f"   ‚Ä¢ {row['Class Name']}: {row['AP@0.5']:.4f}")

print("\nüí° OBSERVATIONS:")
avg_ap = per_class_df['AP@0.5'].mean()
print(f"   ‚Ä¢ Average AP@0.5 across classes: {avg_ap:.4f}")
print(f"   ‚Ä¢ Class with highest AP: {per_class_df.iloc[0]['Class Name']}")
print(f"   ‚Ä¢ Class with lowest AP: {per_class_df.iloc[-1]['Class Name']}")

# Identify minority class performance
minority_classes = ['trench', 'civilian']  # Known minority classes
print(f"\nüîç MINORITY CLASS PERFORMANCE:")
for cls in minority_classes:
    row = per_class_df[per_class_df['Class Name'] == cls]
    if not row.empty:
        print(f"   ‚Ä¢ {cls}: AP@0.5 = {row.iloc[0]['AP@0.5']:.4f}")

print("\n" + "=" * 70)

üìä EVALUATION SUMMARY REPORT

üîß MODEL: best_model

üìà OVERALL METRICS:
   mAP@0.5:      0.4258
   mAP@0.5:0.95: 0.2609
   Precision:    0.6028
   Recall:       0.4101

üèÜ BEST PERFORMING CLASSES (AP@0.5):
   ‚Ä¢ military_aircraft: 0.8206
   ‚Ä¢ military_tank: 0.8159
   ‚Ä¢ camouflage_soldier: 0.6742

‚ö†Ô∏è WORST PERFORMING CLASSES (AP@0.5):
   ‚Ä¢ civilian: 0.0000
   ‚Ä¢ trench: 0.0000
   ‚Ä¢ military_warship: 0.0000

üí° OBSERVATIONS:
   ‚Ä¢ Average AP@0.5 across classes: 0.3903
   ‚Ä¢ Class with highest AP: military_aircraft
   ‚Ä¢ Class with lowest AP: military_warship

üîç MINORITY CLASS PERFORMANCE:
   ‚Ä¢ trench: AP@0.5 = 0.0000
   ‚Ä¢ civilian: AP@0.5 = 0.0000



In [28]:
# Save evaluation results
eval_results = {
    'model_name': selected_model['name'] if available_models else 'unknown',
    'overall_metrics': {
        'mAP50': float(val_results.box.map50),
        'mAP50-95': float(val_results.box.map),
        'precision': float(val_results.box.mp),
        'recall': float(val_results.box.mr)
    },
    'per_class_metrics': per_class_df.to_dict(orient='records')
}

with open(RESULTS_DIR / 'evaluation_results.json', 'w') as f:
    json.dump(eval_results, f, indent=2)

# Save per-class metrics as CSV
per_class_df.to_csv(RESULTS_DIR / 'per_class_metrics.csv', index=False)

print(f"üíæ Evaluation results saved to:")
print(f"   ‚Ä¢ {RESULTS_DIR / 'evaluation_results.json'}")
print(f"   ‚Ä¢ {RESULTS_DIR / 'per_class_metrics.csv'}")

üíæ Evaluation results saved to:
   ‚Ä¢ ../results/evaluation_results.json
   ‚Ä¢ ../results/per_class_metrics.csv


In [29]:
print("\n‚úÖ Evaluation Complete! Proceed to Notebook 05: Inference")


‚úÖ Evaluation Complete! Proceed to Notebook 05: Inference
