# Downstream Evaluation: Species Classification with SpeciesNet

Wildlife species classification from camera trap images represents a crucial downstream task for evaluating image restoration models in ecological applications. This task serves as an excellent testbed for restoration quality assessment because:

1. **Ecological Impact**: Accurate species classification directly supports biodiversity monitoring, conservation efforts, and wildlife research
2. **Challenging Conditions**: Camera trap images often suffer from motion blur, low light, weather artifacts, and partial occlusions, making restoration particularly valuable
3. **Fine-grained Recognition**: Distinguishing between closely related species requires preservation of subtle visual features that could be lost during restoration
4. **Real-world Variability**: Images span diverse environments, lighting conditions, and animal poses, providing comprehensive restoration evaluation

This evaluation framework uses Google's SpeciesNet ensemble, combining MegaDetector for animal detection with a specialized species classifier, to assess how well restored camera trap images maintain the critical visual information needed for accurate wildlife identification.

## Library Imports and Environment Setup

This section imports all necessary libraries for the SpeciesNet evaluation pipeline. The imports encompass data processing (pandas, numpy), visualization (matplotlib, seaborn), machine learning evaluation (scikit-learn), and specialized tools for handling taxonomic data and camera trap imagery.

**Key Library Categories:**
- **Data Processing**: pandas for structured data handling, numpy for numerical computations
- **Visualization**: matplotlib and seaborn for comprehensive plotting and analysis visualization
- **Evaluation Metrics**: scikit-learn for standard classification metrics and confusion matrices
- **Image Processing**: PIL for image loading and basic processing operations
- **Utility Libraries**: pathlib for file system operations, collections for data structures

The environment setup ensures reproducible results and provides clear feedback about the readiness of the evaluation framework.

In [None]:
# Import Required Libraries
import os
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from PIL import Image
from typing import Dict, List, Tuple, Optional, Union, Any
from collections import defaultdict, Counter
import warnings
warnings.filterwarnings('ignore')

# Sklearn imports for evaluation metrics
from sklearn.metrics import (
    accuracy_score, precision_recall_fscore_support, 
    classification_report, confusion_matrix,
    top_k_accuracy_score
)

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")

print("All libraries imported successfully")
print("Ready for species classification evaluation with SpeciesNet")

## Configuration and Setup

This section establishes the configuration parameters for the SpeciesNet evaluation pipeline. The configuration encompasses file paths, model settings, and evaluation parameters that control the entire downstream evaluation process.

**Key Configuration Components:**
- **File Paths**: Locations of camera trap images, ground truth annotations, and output files
- **SpeciesNet Parameters**: Model version, geographic constraints, and processing options
- **Evaluation Settings**: Metrics to compute, taxonomic levels to analyze, and visualization parameters

The configuration is designed to be easily adaptable to different datasets and evaluation scenarios while maintaining consistency with the iWildCam dataset format and SpeciesNet requirements.

In [None]:
# Configuration
class Config:
    def __init__(self):
        # Paths - Update these with your actual paths
        self.images_folder = "Wildlife_Classification/nighttime_low_confidence"  # Folder containing camera trap images
        self.ground_truth_csv = "Wildlife_Classification/ground_truth.csv"  # CSV file with ground truth labels
        self.predictions_json = "Wildlife_Classification/predictions.json"  # Output file for SpeciesNet predictions
        
        # SpeciesNet Configuration
        self.country_code = None  # Optional: 3-letter ISO country code (e.g., "USA", "CAN", "BRA")
        self.admin1_region = None  # Optional: Statgrounde code for US (e.g., "CA", "TX")
        self.model_version = "kaggle:google/speciesnet/pyTorch/v4.0.1a"  # SpeciesNet model version
        
        # Evaluation Parameters
        self.top_k_accuracy = [1, 3, 5]  # Top-k accuracy metrics to compute
        self.taxonomic_levels = ['species', 'genus', 'family', 'order', 'class', 'phylum', 'kingdom']
        
        # Visualization Parameters
        self.max_samples_to_show = 10  # Number of sample images to display
        self.figure_size = (12, 8)
        self.confusion_matrix_size = (15, 12)

# Initialize configuration
config = Config()

print("Configuration initialized!")
print(f"Update the paths in the Config class above:")
print(f"- images_folder: {config.images_folder}")
print(f"- ground_truth_csv: {config.ground_truth_csv}")
print(f"- predictions_json: {config.predictions_json}")

## Installation and Dependencies

### SpeciesNet Package Installation

Install the SpeciesNet Python package for species classification:

```bash
pip install speciesnet
```

**Note for Mac users:** If you encounter installation errors, use:
```bash
pip install speciesnet --use-pep517
```

**Verify Installation:**
```bash
python -m speciesnet.scripts.run_model --help
```

### Required Files and Dependencies

**Model Components:**
- **MegaDetector v5**: Automatically downloaded with SpeciesNet package
- **SpeciesNet Classifier**: EfficientNet V2 M model for species classification
- **Taxonomy Database**: Built-in taxonomic hierarchy for 2000+ species

**Dataset Requirements:**
- Camera trap images in standard formats (JPEG, PNG)
- Ground truth annotations with species labels
- Optional: Geographic metadata (country code, region) for improved accuracy

**Key Citations:**
- Beery, S., et al. (2021). "The iWildCam 2021 Competition Dataset." NeurIPS Datasets and Benchmarks Track.
- Willi, M., et al. (2019). "Identifying animal species in camera trap images using deep learning and citizen science." Methods in Ecology and Evolution.

## SpeciesNet Prediction Pipeline

This section implements the core SpeciesNet prediction pipeline for processing camera trap images. SpeciesNet is a two-stage ensemble system that combines animal detection with species classification to provide robust wildlife identification.

**Pipeline Architecture:**
- **Stage 1 - MegaDetector**: Detects and localizes animals, people, and vehicles in camera trap images
- **Stage 2 - Species Classifier**: Classifies detected animals into 2000+ species using an EfficientNet V2 model
- **Geographic Integration**: Optional geographic priors to improve classification accuracy based on species distribution data
- **Ensemble Logic**: Combines detection confidence, classification scores, and geographic likelihood for final predictions

**Key Features:**
- **Automated Model Download**: Automatically retrieves pre-trained model weights on first use
- **Batch Processing**: Efficient processing of large image datasets with progress tracking
- **Error Handling**: Robust handling of corrupted images, failed detections, and processing errors
- **Flexible Output**: Structured JSON output with detailed prediction information and confidence scores

The pipeline is designed to handle the challenging conditions typical of camera trap imagery, including poor lighting, motion blur, partial occlusions, and diverse environmental contexts.

In [None]:
class SpeciesNetPredictor:
    """
    Wrapper class for running SpeciesNet predictions on camera trap images.
    """
    
    def __init__(self, config: Config):
        self.config = config
        self.predictions = None
        
    def check_speciesnet_installation(self):
        """Check if SpeciesNet is installed and can be imported."""
        try:
            import speciesnet
            print("SpeciesNet is installed and available")
            return True
        except ImportError:
            print("SpeciesNet is not installed.")
            print("Please install it using: pip install speciesnet")
            return False
    
    def prepare_image_list(self):
        """
        Prepare a list of images in the format expected by SpeciesNet.
        Returns a list of image file paths.
        """
        images_path = Path(self.config.images_folder)
        if not images_path.exists():
            raise FileNotFoundError(f"Images folder not found: {self.config.images_folder}")
        
        # Find all image files
        image_extensions = {'.jpg', '.jpeg', '.png', '.tiff', '.tif'}
        image_files = []
        
        for ext in image_extensions:
            image_files.extend(images_path.glob(f"*{ext}"))
            image_files.extend(images_path.glob(f"*{ext.upper()}"))
        
        print(f"Found {len(image_files)} images in {self.config.images_folder}")
        return [str(img) for img in image_files]
    
    def run_speciesnet_command(self):
        """
        Generate and display the SpeciesNet command that should be run.
        """
        cmd_parts = [
            "python -m speciesnet.scripts.run_model",
            f'--folders "{self.config.images_folder}"',
            f'--predictions_json "{self.config.predictions_json}"'
        ]
        
        if self.config.country_code:
            cmd_parts.append(f"--country {self.config.country_code}")
        
        if self.config.admin1_region:
            cmd_parts.append(f"--admin1_region {self.config.admin1_region}")
        
        if self.config.model_version != "kaggle:google/speciesnet/pyTorch/v4.0.1a":
            cmd_parts.append(f"--model {self.config.model_version}")
        
        command = " ".join(cmd_parts)
        
        print("Run the following command in your terminal to generate SpeciesNet predictions:")
        print("" + "="*80)
        print(command)
        print("="*80 + "")
        
        return command
    
    def load_predictions(self):
        """
        Load SpeciesNet predictions from the JSON file.
        """
        predictions_path = Path(self.config.predictions_json)
        if not predictions_path.exists():
            raise FileNotFoundError(f"Predictions file not found: {self.config.predictions_json}")
        
        with open(predictions_path, 'r') as f:
            data = json.load(f)
        
        self.predictions = data.get('predictions', [])
        print(f"Loaded {len(self.predictions)} predictions from {self.config.predictions_json}")
        
        return self.predictions
    
    def get_prediction_summary(self):
        """
        Get a summary of the predictions.
        """
        if not self.predictions:
            print("No predictions loaded. Run load_predictions() first.")
            return
        
        # Count successful vs failed predictions
        successful = [p for p in self.predictions if 'failures' not in p]
        failed = [p for p in self.predictions if 'failures' in p]
        
        print(f"Prediction Summary:")
        print(f"- Total images: {len(self.predictions)}")
        print(f"- Successful predictions: {len(successful)}")
        print(f"- Failed predictions: {len(failed)}")
        
        if failed:
            failure_types = Counter()
            for p in failed:
                for failure in p.get('failures', []):
                    failure_types[failure] += 1
            
            print("Failure types:")
            for failure_type, count in failure_types.items():
                print(f"- {failure_type}: {count}")
        
        # Count prediction types
        if successful:
            prediction_counts = Counter()
            for p in successful:
                pred = p.get('prediction', 'unknown')
                prediction_counts[pred] += 1
            
            print(f"Top 10 most common predictions:")
            for pred, count in prediction_counts.most_common(10):
                print(f"- {pred}: {count}")

# Initialize the predictor
predictor = SpeciesNetPredictor(config)

# Check if SpeciesNet is installed
if predictor.check_speciesnet_installation():
    print("You can proceed with generating predictions.")
else:
    print("Please install SpeciesNet first.")

In [None]:
# Generate SpeciesNet Command
print("Step 1: Generate SpeciesNet predictions")
print("="*50)

# Display the command to run
command = predictor.run_speciesnet_command()

print("After running the command in terminal, come back to this notebook and continue with the evaluation.")
print("Note: SpeciesNet will automatically download model weights on first run.")
print("The prediction process may take several minutes depending on the number of images.")

## Ground Truth Processing and Data Alignment

This section handles the complex task of processing ground truth annotations and aligning them with SpeciesNet predictions. The evaluation framework supports the iWildCam dataset format and provides robust taxonomic parsing capabilities.

**Ground Truth Processing:**
- **Taxonomic Parsing**: Extracts hierarchical taxonomic information from structured labels (Kingdom > Phylum > Class > Order > Family > Genus > Species)
- **Label Standardization**: Handles various label formats and missing taxonomic levels
- **Data Validation**: Ensures consistency and completeness of taxonomic annotations

**Data Alignment Process:**
- **Filename Matching**: Aligns predictions with ground truth based on image filenames
- **Missing Data Handling**: Identifies and reports images with missing predictions or ground truth
- **Error Tracking**: Maintains detailed records of failed predictions and their causes

**Taxonomic Hierarchy Support:**
The framework maintains full taxonomic hierarchy for comprehensive evaluation at multiple levels of biological classification. This enables evaluation of both fine-grained species identification and broader taxonomic group classification, which is crucial for understanding model performance across different levels of biological similarity.

In [None]:
class TaxonomicProcessor:
    """
    Process taxonomic labels and handle different taxonomic levels.
    """
    
    def __init__(self):
        self.taxonomic_levels = ['kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']
        self.common_name_level = 'common_name'
    
    def parse_taxonomic_label(self, label: str) -> Dict[str, str]:
        """
        Parse taxonomic label in format: kingdom;phylum;class;order;family;genus;species;common_name
        Returns a dictionary with taxonomic levels as keys.
        """
        parts = label.split(';')
        taxonomy = {}
        
        # Map parts to taxonomic levels
        for i, level in enumerate(self.taxonomic_levels):
            if i < len(parts):
                taxonomy[level] = parts[i].strip() if parts[i].strip() else 'unknown'
            else:
                taxonomy[level] = 'unknown'
        
        # Add common name if available
        if len(parts) > len(self.taxonomic_levels):
            taxonomy[self.common_name_level] = parts[len(self.taxonomic_levels)].strip()
        else:
            taxonomy[self.common_name_level] = 'unknown'
        
        return taxonomy
    
    def get_taxonomic_level(self, taxonomy: Dict[str, str], level: str) -> str:
        """
        Get the taxonomic label at a specific level.
        """
        return taxonomy.get(level, 'unknown')
    
    def rollup_to_level(self, taxonomy: Dict[str, str], target_level: str) -> str:
        """
        Roll up taxonomy to a higher level (e.g., from species to genus).
        """
        if target_level not in self.taxonomic_levels:
            return 'unknown'
        
        return taxonomy.get(target_level, 'unknown')

class SpeciesEvaluator:
    """
    Main evaluation class for species classification using SpeciesNet.
    """
    
    def __init__(self, config: Config):
        self.config = config
        self.taxonomic_processor = TaxonomicProcessor()
        self.ground_truth_df = None
        self.predictions = None
        self.evaluation_data = None
    
    def load_ground_truth(self):
        """
        Load ground truth from CSV file in iWildCam 2022 format.
        Expected columns: 'Nighttime Image', 'Ground Truth Label', 'Sequence ID'
        """
        if not os.path.exists(self.config.ground_truth_csv):
            raise FileNotFoundError(f"Ground truth file not found: {self.config.ground_truth_csv}")
        
        # Load CSV
        df = pd.read_csv(self.config.ground_truth_csv)
        
        # Check for required columns
        required_columns = ['Nighttime Image', 'Ground Truth Label']
        missing_columns = [col for col in required_columns if col not in df.columns]
        
        if missing_columns:
            print(f"Warning: Missing columns {missing_columns}")
            print(f"Available columns: {list(df.columns)}")
            
            # Try to infer column mapping
            image_col = None
            label_col = None
            
            for col in df.columns:
                if 'image' in col.lower() or 'filename' in col.lower():
                    image_col = col
                elif 'label' in col.lower() or 'ground' in col.lower():
                    label_col = col
            
            if image_col and label_col:
                print(f"Using '{image_col}' as image column and '{label_col}' as label column")
                df = df.rename(columns={image_col: 'Nighttime Image', label_col: 'Ground Truth Label'})
            else:
                raise ValueError("Could not identify image and label columns")
        
        # Parse taxonomic labels
        print("Parsing taxonomic labels...")
        taxonomies = []
        for _, row in df.iterrows():
            taxonomy = self.taxonomic_processor.parse_taxonomic_label(row['Ground Truth Label'])
            taxonomies.append(taxonomy)
        
        # Add taxonomic columns to dataframe
        taxonomy_df = pd.DataFrame(taxonomies)
        self.ground_truth_df = pd.concat([df, taxonomy_df], axis=1)
        
        print(f"Loaded {len(self.ground_truth_df)} ground truth labels")
        print(f"Parsed taxonomic information for {len(taxonomies)} labels")
        
        # Show sample of taxonomic parsing
        print("Sample taxonomic parsing:")
        for i in range(min(3, len(self.ground_truth_df))):
            row = self.ground_truth_df.iloc[i]
            print(f"Image: {row['Nighttime Image']}")
            print(f"Original: {row['Ground Truth Label']}")
            print(f"Species: {row['species']}, Genus: {row['genus']}, Family: {row['family']}")
            print("-" * 50)
        
        return self.ground_truth_df
    
    def load_predictions(self):
        """
        Load SpeciesNet predictions from JSON file.
        """
        if not os.path.exists(self.config.predictions_json):
            raise FileNotFoundError(f"Predictions file not found: {self.config.predictions_json}")
        
        with open(self.config.predictions_json, 'r') as f:
            data = json.load(f)
        
        self.predictions = data.get('predictions', [])
        print(f"Loaded {len(self.predictions)} SpeciesNet predictions")
        
        return self.predictions
    
    def align_data(self):
        """
        Align ground truth and predictions by filename.
        """
        if self.ground_truth_df is None:
            raise ValueError("Ground truth not loaded. Call load_ground_truth() first.")
        
        if self.predictions is None:
            raise ValueError("Predictions not loaded. Call load_predictions() first.")
        
        # Create prediction lookup by filename
        pred_lookup = {}
        for pred in self.predictions:
            filepath = pred.get('filepath', '')
            filename = os.path.basename(filepath)
            pred_lookup[filename] = pred
        
        # Align with ground truth
        aligned_data = []
        missing_predictions = []
        
        for _, row in self.ground_truth_df.iterrows():
            filename = row['Nighttime Image']
            
            if filename in pred_lookup:
                pred = pred_lookup[filename]
                
                # Extract prediction info
                prediction = pred.get('prediction', 'unknown')
                prediction_score = pred.get('prediction_score', 0.0)
                failures = pred.get('failures', [])
                
                aligned_data.append({
                    'filename': filename,
                    'ground_truth_full': row['Ground Truth Label'],
                    'gt_species': row['species'],
                    'gt_genus': row['genus'],
                    'gt_family': row['family'],
                    'gt_order': row['order'],
                    'gt_class': row['class'],
                    'gt_phylum': row['phylum'],
                    'gt_kingdom': row['kingdom'],
                    'gt_common_name': row.get('common_name', 'unknown'),
                    'prediction': prediction,
                    'prediction_score': prediction_score,
                    'failed': len(failures) > 0,
                    'failure_types': failures
                })
            else:
                missing_predictions.append(filename)
        
        self.evaluation_data = pd.DataFrame(aligned_data)
        
        print(f"Aligned {len(self.evaluation_data)} image predictions with ground truth")
        
        if missing_predictions:
            print(f"Missing predictions for {len(missing_predictions)} images")
            if len(missing_predictions) <= 10:
                print("Missing predictions for:", missing_predictions)
            else:
                print(f"First 10 missing:", missing_predictions[:10])
        
        return self.evaluation_data

# Initialize the evaluator
evaluator = SpeciesEvaluator(config)
print("Species evaluator initialized")

## Data Loading and Validation

This section executes the data loading pipeline and performs comprehensive validation of the ground truth and prediction data. The process ensures data integrity and alignment before proceeding with evaluation metrics computation.

**Data Loading Steps:**
1. **Ground Truth Loading**: Reads and parses taxonomic annotations from CSV format
2. **Prediction Loading**: Imports SpeciesNet results from JSON output files
3. **Data Alignment**: Matches predictions with ground truth by filename
4. **Validation Checks**: Identifies missing data, failed predictions, and format inconsistencies

**Quality Assurance:**
The framework performs extensive validation to ensure reliable evaluation results, including verification of taxonomic label parsing, prediction format consistency, and data completeness assessment.

In [None]:
# Load ground truth data
print("Loading ground truth data...")
try:
    ground_truth_df = evaluator.load_ground_truth()
    print("Ground truth loaded successfully")
except FileNotFoundError as e:
    print(f"Error: {e}")
    print("Please update the ground_truth_csv path in the configuration section above.")
except Exception as e:
    print(f"Error loading ground truth: {e}")

# Load SpeciesNet predictions
print("Loading SpeciesNet predictions...")
try:
    predictions = evaluator.load_predictions()
    print("Predictions loaded successfully")
except FileNotFoundError as e:
    print(f"Error: {e}")
    print("Please run the SpeciesNet command first, then update the predictions_json path.")
except Exception as e:
    print(f"Error loading predictions: {e}")

# Align ground truth and predictions
print("Aligning ground truth with predictions...")
try:
    evaluation_data = evaluator.align_data()
    print("Data alignment completed")
    
    # Show basic statistics
    print(f"Dataset Statistics:")
    print(f"- Total aligned samples: {len(evaluation_data)}")
    print(f"- Successful predictions: {len(evaluation_data[~evaluation_data['failed']])}")
    print(f"- Failed predictions: {len(evaluation_data[evaluation_data['failed']])}")
    
except Exception as e:
    print(f"Error aligning data: {e}")

# Display sample of aligned data
if 'evaluation_data' in locals():
    print("Sample of aligned data:")
    print(evaluation_data[['filename', 'gt_species', 'prediction', 'prediction_score']].head())

## Comprehensive Evaluation Metrics

This section implements a sophisticated evaluation framework specifically designed for wildlife species classification. The metrics account for the hierarchical nature of taxonomic classification and the unique challenges of camera trap image analysis.

### Primary Evaluation Metrics

**Species-Level Classification Metrics:**
- **Accuracy**: Overall fraction of correctly identified species
- **Precision (Weighted)**: Average precision across species, weighted by species frequency in the dataset
- **Recall (Weighted)**: Average recall across species, weighted by species frequency
- **F1-Score (Weighted)**: Harmonic mean of precision and recall, accounting for class imbalance

**Hierarchical Taxonomic Accuracy:**
- **Multi-level Accuracy**: Evaluates correctness at different taxonomic levels (Kingdom, Phylum, Class, Order, Family, Genus, Species)
- **Taxonomic Distance**: Measures how "close" incorrect predictions are to the true species in taxonomic hierarchy
- **Hierarchical Precision**: Considers predictions correct if they match at any taxonomic level above species

**Confidence and Reliability Metrics:**
- **Prediction Confidence Analysis**: Distribution of model confidence scores and their calibration
- **Confidence-Accuracy Correlation**: Relationship between prediction confidence and actual accuracy
- **Failure Mode Analysis**: Categorization and frequency of different types of prediction failures

### Advanced Evaluation Features

**Class Imbalance Handling:**
Wildlife datasets typically exhibit severe class imbalance, with some species having thousands of examples while others have only a few. The evaluation framework uses weighted metrics to account for this imbalance and provides per-class performance analysis.

**Taxonomic Hierarchy Awareness:**
Unlike generic image classification, species identification benefits from taxonomic knowledge. A prediction of "Gray Wolf" when the true species is "Red Wolf" is more acceptable than predicting "House Cat" - both are taxonomically closer and ecologically more reasonable errors.

**Geographic and Temporal Context:**
The framework can incorporate geographic priors and temporal information when available, providing more nuanced evaluation that reflects real-world deployment scenarios where species distributions and seasonal patterns matter.

**Error Pattern Analysis:**
Systematic analysis of common misclassification patterns helps identify whether errors are due to visual similarity, taxonomic confusion, or systematic biases in the model or dataset.

In [None]:
def compute_taxonomic_metrics(evaluation_data: pd.DataFrame, taxonomic_level: str = 'species'):
    """
    Compute classification metrics at a specific taxonomic level.
    """
    # Filter out failed predictions
    successful_data = evaluation_data[~evaluation_data['failed']].copy()
    
    if len(successful_data) == 0:
        print(f"No successful predictions found for {taxonomic_level} level evaluation")
        return None
    
    # Get ground truth and predictions at the specified level
    gt_column = f'gt_{taxonomic_level}'
    
    if gt_column not in successful_data.columns:
        print(f"Ground truth column {gt_column} not found")
        return None
    
    y_true = successful_data[gt_column].values
    y_pred = successful_data['prediction'].values
    
    # Handle predictions that might not match the taxonomic level
    # For SpeciesNet, predictions can be at any taxonomic level
    processed_predictions = []
    for pred in y_pred:
        # SpeciesNet predictions might be common names or scientific names
        # We'll use them as-is for now, but this could be enhanced with
        # taxonomic matching/mapping
        processed_predictions.append(pred)
    
    y_pred = processed_predictions
    
    # Get unique classes
    all_classes = sorted(list(set(y_true) | set(y_pred)))
    
    # Compute metrics
    accuracy = accuracy_score(y_true, y_pred)
    precision, recall, f1, support = precision_recall_fscore_support(
        y_true, y_pred, average='weighted', zero_division=0
    )
    
    # Compute per-class metrics
    per_class_metrics = classification_report(
        y_true, y_pred, 
        labels=all_classes,
        target_names=all_classes,
        output_dict=True,
        zero_division=0
    )
    
    # Confusion matrix
    cm = confusion_matrix(y_true, y_pred, labels=all_classes)
    
    results = {
        'taxonomic_level': taxonomic_level,
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1_score': f1,
        'num_samples': len(successful_data),
        'num_classes': len(all_classes),
        'per_class_metrics': per_class_metrics,
        'confusion_matrix': cm,
        'class_labels': all_classes,
        'y_true': y_true,
        'y_pred': y_pred
    }
    
    return results

def compute_hierarchical_accuracy(evaluation_data: pd.DataFrame):
    """
    Compute hierarchical accuracy where a prediction is considered correct
    if it matches at any taxonomic level (e.g., correct genus even if species is wrong).
    """
    successful_data = evaluation_data[~evaluation_data['failed']].copy()
    
    if len(successful_data) == 0:
        return None
    
    taxonomic_levels = ['species', 'genus', 'family', 'order', 'class', 'phylum', 'kingdom']
    hierarchical_matches = []
    
    for _, row in successful_data.iterrows():
        prediction = row['prediction'].lower().strip()
        matches_at_level = {}
        
        # Check if prediction matches at any taxonomic level
        for level in taxonomic_levels:
            gt_value = str(row[f'gt_{level}']).lower().strip()
            
            # Direct match
            if prediction == gt_value:
                matches_at_level[level] = True
            # Partial match (prediction contains ground truth or vice versa)
            elif prediction in gt_value or gt_value in prediction:
                matches_at_level[level] = True
            else:
                matches_at_level[level] = False
        
        hierarchical_matches.append(matches_at_level)
    
    # Calculate accuracy at each level
    hierarchical_accuracy = {}
    for level in taxonomic_levels:
        correct_at_level = sum([match[level] for match in hierarchical_matches])
        hierarchical_accuracy[level] = correct_at_level / len(hierarchical_matches)
    
    return hierarchical_accuracy

# Compute metrics if data is available
if 'evaluation_data' in locals() and len(evaluation_data) > 0:
    print("Computing evaluation metrics...")
    
    # Compute species-level metrics (primary evaluation)
    species_metrics = compute_taxonomic_metrics(evaluation_data, 'species')
    
    if species_metrics:
        print(f"{'='*60}")
        print(f"SPECIES-LEVEL CLASSIFICATION METRICS")
        print(f"{'='*60}")
        print(f"Accuracy: {species_metrics['accuracy']:.4f}")
        print(f"Precision (weighted): {species_metrics['precision']:.4f}")
        print(f"Recall (weighted): {species_metrics['recall']:.4f}")
        print(f"F1-Score (weighted): {species_metrics['f1_score']:.4f}")
        print(f"Number of samples: {species_metrics['num_samples']}")
        print(f"Number of classes: {species_metrics['num_classes']}")
    
    # Compute hierarchical accuracy
    print(f"{'='*60}")
    print(f"HIERARCHICAL ACCURACY")
    print(f"{'='*60}")
    
    hierarchical_acc = compute_hierarchical_accuracy(evaluation_data)
    if hierarchical_acc:
        for level, acc in hierarchical_acc.items():
            print(f"{level.capitalize()}: {acc:.4f}")
    
    # Compute metrics at other taxonomic levels
    other_levels = ['genus', 'family', 'order']
    other_metrics = {}
    
    for level in other_levels:
        metrics = compute_taxonomic_metrics(evaluation_data, level)
        if metrics:
            other_metrics[level] = metrics
            print(f"{level.upper()}-LEVEL ACCURACY: {metrics['accuracy']:.4f}")
    
else:
    print("Evaluation data not available. Please load and align data first.")

## Visualization and Performance Analysis

This section provides comprehensive visualization tools for understanding SpeciesNet performance patterns and identifying areas for improvement. The visualizations are specifically designed for wildlife classification analysis and taxonomic evaluation.

**Visualization Components:**

**Distribution Analysis:**
- **Species Distribution Plots**: Compare the frequency distribution of predicted vs ground truth species
- **Taxonomic Level Comparisons**: Side-by-side analysis of predictions and ground truth at different taxonomic levels
- **Geographic Distribution**: Spatial analysis of prediction accuracy across different regions (when geographic data is available)

**Performance Visualizations:**
- **Hierarchical Accuracy Charts**: Bar charts showing accuracy at each taxonomic level
- **Confusion Matrices**: Heat maps showing classification patterns, with options for different taxonomic levels
- **Confidence Distribution Plots**: Histograms and box plots of prediction confidence scores

**Error Analysis Visualizations:**
- **Error Pattern Heat Maps**: Visualization of common species confusion patterns
- **Confidence vs Accuracy Scatter Plots**: Analysis of model calibration and overconfidence
- **Failure Mode Distribution**: Categorical analysis of why predictions fail

**Interactive Sample Analysis:**
- **Sample Image Grids**: Display representative images with predictions and ground truth
- **Error Case Studies**: Detailed examination of common misclassification patterns
- **Confidence-based Sampling**: Show high/low confidence predictions for manual inspection

These visualizations are essential for understanding model behavior, identifying systematic biases, and making informed decisions about model deployment and improvement strategies in wildlife monitoring applications.

In [None]:
def plot_prediction_distribution(evaluation_data: pd.DataFrame, top_n: int = 20):
    """
    Plot distribution of predictions and ground truth labels.
    """
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))
    
    # Successful predictions only
    successful_data = evaluation_data[~evaluation_data['failed']]
    
    # Ground truth distribution
    gt_counts = successful_data['gt_species'].value_counts().head(top_n)
    ax1.barh(range(len(gt_counts)), gt_counts.values)
    ax1.set_yticks(range(len(gt_counts)))
    ax1.set_yticklabels(gt_counts.index, fontsize=10)
    ax1.set_xlabel('Count')
    ax1.set_title(f'Top {top_n} Ground Truth Species')
    ax1.invert_yaxis()
    
    # Prediction distribution
    pred_counts = successful_data['prediction'].value_counts().head(top_n)
    ax2.barh(range(len(pred_counts)), pred_counts.values, color='orange')
    ax2.set_yticks(range(len(pred_counts)))
    ax2.set_yticklabels(pred_counts.index, fontsize=10)
    ax2.set_xlabel('Count')
    ax2.set_title(f'Top {top_n} SpeciesNet Predictions')
    ax2.invert_yaxis()
    
    plt.tight_layout()
    plt.show()

def plot_hierarchical_accuracy(hierarchical_acc: Dict[str, float]):
    """
    Plot hierarchical accuracy across taxonomic levels.
    """
    if not hierarchical_acc:
        print("No hierarchical accuracy data available")
        return
    
    levels = list(hierarchical_acc.keys())
    accuracies = list(hierarchical_acc.values())
    
    plt.figure(figsize=(12, 6))
    bars = plt.bar(levels, accuracies, color='skyblue', alpha=0.8, edgecolor='navy')
    plt.xlabel('Taxonomic Level')
    plt.ylabel('Accuracy')
    plt.title('Hierarchical Accuracy Across Taxonomic Levels')
    plt.ylim(0, 1)
    
    # Add value labels on bars
    for bar, acc in zip(bars, accuracies):
        plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, 
                f'{acc:.3f}', ha='center', va='bottom', fontweight='bold')
    
    plt.xticks(rotation=45)
    plt.grid(axis='y', alpha=0.3)
    plt.tight_layout()
    plt.show()

def plot_confusion_matrix_subset(cm, class_labels, top_n: int = 15):
    """
    Plot confusion matrix for top N most common classes.
    """
    if len(class_labels) <= top_n:
        # Use all classes if we have fewer than top_n
        selected_indices = list(range(len(class_labels)))
        selected_labels = class_labels
        selected_cm = cm
    else:
        # Select top N classes by support (sum of rows)
        class_support = cm.sum(axis=1)
        top_indices = np.argsort(class_support)[-top_n:][::-1]
        
        selected_indices = top_indices
        selected_labels = [class_labels[i] for i in top_indices]
        selected_cm = cm[np.ix_(top_indices, top_indices)]
    
    plt.figure(figsize=(12, 10))
    
    # Normalize confusion matrix
    cm_normalized = selected_cm.astype('float') / selected_cm.sum(axis=1)[:, np.newaxis]
    cm_normalized = np.nan_to_num(cm_normalized)  # Handle division by zero
    
    sns.heatmap(cm_normalized, 
                xticklabels=selected_labels, 
                yticklabels=selected_labels,
                annot=True, 
                fmt='.2f', 
                cmap='Blues',
                cbar_kws={'label': 'Normalized Count'})
    
    plt.xlabel('Predicted Species')
    plt.ylabel('True Species')
    plt.title(f'Confusion Matrix - Top {len(selected_labels)} Most Common Species')
    plt.xticks(rotation=45, ha='right')
    plt.yticks(rotation=0)
    plt.tight_layout()
    plt.show()

def plot_confidence_distribution(evaluation_data: pd.DataFrame):
    """
    Plot distribution of prediction confidence scores.
    """
    successful_data = evaluation_data[~evaluation_data['failed']]
    
    if len(successful_data) == 0:
        print("No successful predictions to analyze")
        return
    
    plt.figure(figsize=(12, 5))
    
    # Overall confidence distribution
    plt.subplot(1, 2, 1)
    plt.hist(successful_data['prediction_score'], bins=50, alpha=0.7, color='skyblue', edgecolor='black')
    plt.xlabel('Prediction Score')
    plt.ylabel('Frequency')
    plt.title('Distribution of SpeciesNet Confidence Scores')
    plt.grid(alpha=0.3)
    
    # Confidence vs accuracy
    plt.subplot(1, 2, 2)
    
    # Create confidence bins
    bins = np.linspace(0, 1, 11)
    bin_centers = (bins[:-1] + bins[1:]) / 2
    bin_accuracies = []
    bin_counts = []
    
    for i in range(len(bins)-1):
        mask = (successful_data['prediction_score'] >= bins[i]) & (successful_data['prediction_score'] < bins[i+1])
        if i == len(bins)-2:  # Include the last bin's right edge
            mask = (successful_data['prediction_score'] >= bins[i]) & (successful_data['prediction_score'] <= bins[i+1])
        
        bin_data = successful_data[mask]
        if len(bin_data) > 0:
            # Simple accuracy check (this could be enhanced with proper taxonomic matching)
            correct = (bin_data['prediction'].str.lower() == bin_data['gt_species'].str.lower()).sum()
            accuracy = correct / len(bin_data)
            bin_accuracies.append(accuracy)
            bin_counts.append(len(bin_data))
        else:
            bin_accuracies.append(0)
            bin_counts.append(0)
    
    # Plot confidence vs accuracy
    plt.scatter(bin_centers, bin_accuracies, s=[c*3 for c in bin_counts], alpha=0.7, color='orange')
    plt.xlabel('Confidence Score')
    plt.ylabel('Accuracy')
    plt.title('Confidence vs Accuracy')
    plt.grid(alpha=0.3)
    
    plt.tight_layout()
    plt.show()

# Generate visualizations if data is available
if 'evaluation_data' in locals() and len(evaluation_data) > 0:
    print("Generating visualizations...")
    
    # 1. Prediction distribution
    plot_prediction_distribution(evaluation_data)
    
    # 2. Hierarchical accuracy
    if 'hierarchical_acc' in locals():
        plot_hierarchical_accuracy(hierarchical_acc)
    
    # 3. Confusion matrix
    if 'species_metrics' in locals() and species_metrics:
        plot_confusion_matrix_subset(species_metrics['confusion_matrix'], 
                                   species_metrics['class_labels'])
    
    # 4. Confidence distribution
    plot_confidence_distribution(evaluation_data)
    
else:
    print("No evaluation data available for visualization.")

## Sample Image Analysis and Error Investigation

This section provides detailed visual analysis capabilities for understanding model performance through direct examination of camera trap images. Visual inspection is crucial for wildlife classification evaluation as it reveals patterns that purely numerical metrics might miss.

**Sample Analysis Features:**

**Stratified Sampling:**
- **Correct Classifications**: Examples where SpeciesNet correctly identified the species
- **Incorrect Classifications**: Cases where the prediction was wrong, with analysis of error types
- **High Confidence Samples**: Images where the model was very confident (confidence > 0.8)
- **Low Confidence Samples**: Uncertain predictions that might require human review
- **Mixed Sampling**: Representative samples across all categories for comprehensive overview

**Error Pattern Investigation:**
- **Taxonomic Error Analysis**: Examination of whether errors occur within the same taxonomic family/genus
- **Visual Similarity Confusions**: Cases where species are visually similar but taxonomically different
- **Environmental Factor Analysis**: How lighting, weather, or habitat affects classification accuracy
- **Pose and Occlusion Effects**: Impact of animal positioning and partial visibility on predictions

**Quality Assessment:**
Visual inspection helps validate the evaluation metrics by providing context for numerical results. It reveals whether model errors are "reasonable" (confusing similar-looking species) or indicate systematic issues (consistently misidentifying distinctive species).

In [None]:
def display_sample_images(evaluation_data: pd.DataFrame, images_folder: str, 
                         sample_type: str = 'mixed', n_samples: int = 6):
    """
    Display sample images with their predictions and ground truth.
    
    sample_type options:
    - 'correct': Show correctly classified images
    - 'incorrect': Show incorrectly classified images  
    - 'high_confidence': Show high confidence predictions
    - 'low_confidence': Show low confidence predictions
    - 'mixed': Show a mix of different cases
    """
    successful_data = evaluation_data[~evaluation_data['failed']]
    
    if len(successful_data) == 0:
        print("No successful predictions to display")
        return
    
    # Create simple accuracy check (could be enhanced with taxonomic matching)
    successful_data = successful_data.copy()
    successful_data['simple_correct'] = (
        successful_data['prediction'].str.lower().str.strip() == 
        successful_data['gt_species'].str.lower().str.strip()
    )
    
    # Select samples based on type
    if sample_type == 'correct':
        sample_data = successful_data[successful_data['simple_correct']]
        title_suffix = "Correctly Classified"
    elif sample_type == 'incorrect':
        sample_data = successful_data[~successful_data['simple_correct']]
        title_suffix = "Incorrectly Classified"
    elif sample_type == 'high_confidence':
        sample_data = successful_data[successful_data['prediction_score'] > 0.8]
        title_suffix = "High Confidence (>0.8)"
    elif sample_type == 'low_confidence':
        sample_data = successful_data[successful_data['prediction_score'] < 0.3]
        title_suffix = "Low Confidence (<0.3)"
    else:  # mixed
        sample_data = successful_data
        title_suffix = "Mixed Sample"
    
    if len(sample_data) == 0:
        print(f"No samples found for type: {sample_type}")
        return
    
    # Sample randomly
    sample_indices = np.random.choice(len(sample_data), 
                                    size=min(n_samples, len(sample_data)), 
                                    replace=False)
    samples = sample_data.iloc[sample_indices]
    
    # Create subplot grid
    n_cols = 3
    n_rows = (len(samples) + n_cols - 1) // n_cols
    
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 5*n_rows))
    if n_rows == 1:
        axes = axes.reshape(1, -1)
    
    fig.suptitle(f'Sample Images - {title_suffix}', fontsize=16, fontweight='bold')
    
    for idx, (_, row) in enumerate(samples.iterrows()):
        ax = axes[idx // n_cols, idx % n_cols]
        
        # Try to load and display image
        image_path = os.path.join(images_folder, row['filename'])
        
        try:
            if os.path.exists(image_path):
                img = Image.open(image_path)
                ax.imshow(img)
            else:
                # Create placeholder if image not found
                ax.text(0.5, 0.5, f"Image not found:{row['filename']}", 
                       ha='center', va='center', transform=ax.transAxes, fontsize=10)
        except Exception as e:
            ax.text(0.5, 0.5, f"Error loading image:{str(e)}", 
                   ha='center', va='center', transform=ax.transAxes, fontsize=10)
        
        # Add title with prediction info
        gt_species = row['gt_species']
        prediction = row['prediction']
        confidence = row['prediction_score']
        
        title = f"GT: {gt_species}Pred: {prediction}Conf: {confidence:.3f}"
        
        # Color code the title
        if row['simple_correct']:
            title_color = 'green'
        else:
            title_color = 'red'
        
        ax.set_title(title, fontsize=10, color=title_color, fontweight='bold')
        ax.axis('off')
    
    # Hide empty subplots
    for idx in range(len(samples), n_rows * n_cols):
        axes[idx // n_cols, idx % n_cols].axis('off')
    
    plt.tight_layout()
    plt.show()

def analyze_prediction_errors(evaluation_data: pd.DataFrame, top_n: int = 10):
    """
    Analyze common prediction errors.
    """
    successful_data = evaluation_data[~evaluation_data['failed']].copy()
    
    if len(successful_data) == 0:
        print("No successful predictions to analyze")
        return
    
    # Simple accuracy check
    successful_data['simple_correct'] = (
        successful_data['prediction'].str.lower().str.strip() == 
        successful_data['gt_species'].str.lower().str.strip()
    )
    
    incorrect_data = successful_data[~successful_data['simple_correct']]
    
    if len(incorrect_data) == 0:
        print("No incorrect predictions found!")
        return
    
    print(f"{'='*60}")
    print(f"PREDICTION ERROR ANALYSIS")
    print(f"{'='*60}")
    print(f"Total samples: {len(successful_data)}")
    print(f"Correct predictions: {len(successful_data[successful_data['simple_correct']])}")
    print(f"Incorrect predictions: {len(incorrect_data)}")
    print(f"Simple accuracy: {len(successful_data[successful_data['simple_correct']]) / len(successful_data):.4f}")
    
    # Most common error patterns
    print(f"Most common error patterns:")
    error_patterns = incorrect_data.groupby(['gt_species', 'prediction']).size().sort_values(ascending=False)
    
    for i, ((gt, pred), count) in enumerate(error_patterns.head(top_n).items()):
        print(f"{i+1:2d}. {gt} → {pred} ({count} times)")
    
    # Most commonly confused ground truth species
    print(f"Most commonly misclassified ground truth species:")
    confused_gt = incorrect_data['gt_species'].value_counts().head(top_n)
    
    for i, (species, count) in enumerate(confused_gt.items()):
        total_gt_count = successful_data[successful_data['gt_species'] == species].shape[0]
        error_rate = count / total_gt_count if total_gt_count > 0 else 0
        print(f"{i+1:2d}. {species}: {count}/{total_gt_count} errors (error rate: {error_rate:.3f})")
    
    # Most common incorrect predictions
    print(f"Most common incorrect predictions:")
    wrong_preds = incorrect_data['prediction'].value_counts().head(top_n)
    
    for i, (prediction, count) in enumerate(wrong_preds.items()):
        print(f"{i+1:2d}. {prediction}: {count} times")

# Display sample images if data is available
if 'evaluation_data' in locals() and len(evaluation_data) > 0:
    print("Displaying sample images...")
    
    # Show mixed samples first
    display_sample_images(evaluation_data, config.images_folder, 'mixed', 6)
    
    # Analyze prediction errors
    analyze_prediction_errors(evaluation_data)

    print("="*60)
    print("Additional sample types you can explore:")
    print("- display_sample_images(evaluation_data, config.images_folder, 'correct', 6)")
    print("- display_sample_images(evaluation_data, config.images_folder, 'incorrect', 6)")
    print("- display_sample_images(evaluation_data, config.images_folder, 'high_confidence', 6)")
    print("- display_sample_images(evaluation_data, config.images_folder, 'low_confidence', 6)")
    
else:
    print("No evaluation data available for sample analysis.")

## Comprehensive Evaluation Summary and Export

This section generates detailed evaluation reports and exports results in multiple formats for further analysis and integration with other research tools. The comprehensive summary provides actionable insights for model improvement and deployment decisions.

**Summary Report Components:**

**Dataset Overview:**
- Total number of images processed and success/failure rates
- Species diversity and distribution statistics  
- Quality assessment of ground truth annotations

**Performance Summary:**
- Overall accuracy metrics across all taxonomic levels
- Confidence score statistics and calibration analysis
- Identification of best and worst performing species categories

**Error Analysis:**
- Common misclassification patterns and their frequencies
- Systematic error identification (consistent confusion between specific species pairs)
- Failure mode categorization (detection failures vs classification errors)

**Export Formats:**
- **CSV Files**: Detailed per-image results for integration with other analysis tools
- **JSON Summaries**: Structured metrics data for programmatic access
- **Text Reports**: Human-readable comprehensive evaluation summaries

**Deployment Recommendations:**
Based on the evaluation results, the framework provides specific recommendations for:
- Model deployment strategies in different scenarios
- Confidence threshold optimization
- Species-specific performance considerations
- Areas requiring additional training data or model improvement

In [None]:
def generate_evaluation_summary(evaluation_data: pd.DataFrame, 
                              species_metrics: Optional[Dict] = None,
                              hierarchical_acc: Optional[Dict] = None,
                              other_metrics: Optional[Dict] = None):
    """
    Generate a comprehensive evaluation summary.
    """
    print("" + "="*80)
    print("SPECIESNET EVALUATION SUMMARY")
    print("="*80)
    
    # Dataset overview
    total_samples = len(evaluation_data)
    successful_samples = len(evaluation_data[~evaluation_data['failed']])
    failed_samples = len(evaluation_data[evaluation_data['failed']])
    
    print(f"Dataset Overview:")
    print(f"- Total samples: {total_samples}")
    print(f"- Successful predictions: {successful_samples} ({successful_samples/total_samples*100:.1f}%)")
    print(f"- Failed predictions: {failed_samples} ({failed_samples/total_samples*100:.1f}%)")
    
    # Failure analysis
    if failed_samples > 0:
        failure_types = Counter()
        for _, row in evaluation_data[evaluation_data['failed']].iterrows():
            for failure in row['failure_types']:
                failure_types[failure] += 1
        
        print(f"Failure Analysis:")
        for failure_type, count in failure_types.items():
            print(f"- {failure_type}: {count} ({count/failed_samples*100:.1f}% of failures)")
    
    # Species-level performance
    if species_metrics:
        print(f"Species-Level Performance:")
        print(f"- Accuracy: {species_metrics['accuracy']:.4f}")
        print(f"- Precision (weighted): {species_metrics['precision']:.4f}")
        print(f"- Recall (weighted): {species_metrics['recall']:.4f}")
        print(f"- F1-Score (weighted): {species_metrics['f1_score']:.4f}")
        print(f"- Number of unique species: {species_metrics['num_classes']}")
    
    # Hierarchical accuracy
    if hierarchical_acc:
        print(f"Hierarchical Accuracy:")
        for level, acc in hierarchical_acc.items():
            print(f"- {level.capitalize()}: {acc:.4f}")
    
    # Other taxonomic levels
    if other_metrics:
        print(f"Other Taxonomic Levels:")
        for level, metrics in other_metrics.items():
            print(f"- {level.capitalize()}: {metrics['accuracy']:.4f}")
    
    # Confidence statistics
    successful_data = evaluation_data[~evaluation_data['failed']]
    if len(successful_data) > 0:
        confidence_stats = successful_data['prediction_score'].describe()
        print(f"Confidence Score Statistics:")
        print(f"- Mean: {confidence_stats['mean']:.4f}")
        print(f"- Median: {confidence_stats['50%']:.4f}")
        print(f"- Std: {confidence_stats['std']:.4f}")
        print(f"- Min: {confidence_stats['min']:.4f}")
        print(f"- Max: {confidence_stats['max']:.4f}")
    
    # Top predictions
    if len(successful_data) > 0:
        top_predictions = successful_data['prediction'].value_counts().head(10)
        print(f"Top 10 Most Common Predictions:")
        for i, (pred, count) in enumerate(top_predictions.items()):
            print(f"{i+1:2d}. {pred}: {count} ({count/len(successful_data)*100:.1f}%)")

def export_results(evaluation_data: pd.DataFrame, 
                  species_metrics: Optional[Dict] = None,
                  output_dir: str = "."):
    """
    Export evaluation results to files.
    """
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)
    
    # Export detailed results
    results_file = output_path / "speciesnet_evaluation_results.csv"
    evaluation_data.to_csv(results_file, index=False)
    print(f"Detailed results exported to: {results_file}")
    
    # Export summary metrics
    if species_metrics:
        metrics_file = output_path / "speciesnet_metrics_summary.json"
        
        # Prepare metrics for JSON serialization
        exportable_metrics = {
            'taxonomic_level': species_metrics['taxonomic_level'],
            'accuracy': float(species_metrics['accuracy']),
            'precision': float(species_metrics['precision']),
            'recall': float(species_metrics['recall']),
            'f1_score': float(species_metrics['f1_score']),
            'num_samples': int(species_metrics['num_samples']),
            'num_classes': int(species_metrics['num_classes'])
        }
        
        with open(metrics_file, 'w') as f:
            json.dump(exportable_metrics, f, indent=2)
        
        print(f"Metrics summary exported to: {metrics_file}")
    
    # Export per-class metrics
    if species_metrics and 'per_class_metrics' in species_metrics:
        per_class_file = output_path / "speciesnet_per_class_metrics.json"
        
        with open(per_class_file, 'w') as f:
            json.dump(species_metrics['per_class_metrics'], f, indent=2)
        
        print(f"Per-class metrics exported to: {per_class_file}")

# Generate summary and export results if data is available
if 'evaluation_data' in locals() and len(evaluation_data) > 0:
    print("Generating evaluation summary...")
    
    # Generate comprehensive summary
    generate_evaluation_summary(
        evaluation_data,
        species_metrics if 'species_metrics' in locals() else None,
        hierarchical_acc if 'hierarchical_acc' in locals() else None,
        other_metrics if 'other_metrics' in locals() else None
    )
    
    # Export results
    try:
        export_results(
            evaluation_data,
            species_metrics if 'species_metrics' in locals() else None,
            output_dir="."
        )
        print("All results exported successfully!")
    except Exception as e:
        print(f"Error exporting results: {e}")
    
    print("" + "="*80)
    print("EVALUATION COMPLETE!")
    print("="*80)
    print("Next steps:")
    print("1. Review the exported CSV file for detailed per-image results")
    print("2. Use the JSON files for integration with other analysis tools")
    print("3. Consider fine-tuning or domain adaptation if accuracy is low")
    print("4. Explore taxonomic hierarchy-aware evaluation metrics")
    
else:
    print("No evaluation data available for summary generation.")

In [None]:
# Complete evaluation pipeline summary

print("\n" + "="*80)
print("SPECIESNET WILDLIFE CLASSIFICATION EVALUATION COMPLETE")
print("="*80)

if 'evaluation_data' in locals() and len(evaluation_data) > 0:
    print("\nEVALUATION SUMMARY:")
    successful_preds = len(evaluation_data[~evaluation_data['failed']])
    total_preds = len(evaluation_data)
    success_rate = successful_preds / total_preds * 100 if total_preds > 0 else 0
    
    print(f"  - Total images processed: {total_preds}")
    print(f"  - Successful predictions: {successful_preds} ({success_rate:.1f}%)")
    
    if 'species_metrics' in locals() and species_metrics:
        print(f"  - Species-level accuracy: {species_metrics['accuracy']:.4f}")
        print(f"  - Weighted F1-score: {species_metrics['f1_score']:.4f}")
        print(f"  - Number of species: {species_metrics['num_classes']}")
    
    if 'hierarchical_acc' in locals() and hierarchical_acc:
        print(f"  - Genus-level accuracy: {hierarchical_acc.get('genus', 0):.4f}")
        print(f"  - Family-level accuracy: {hierarchical_acc.get('family', 0):.4f}")
    
    print(f"\nOutput files generated:")
    print("  - speciesnet_evaluation_results.csv (detailed per-image results)")
    print("  - speciesnet_metrics_summary.json (overall performance metrics)")
    print("  - speciesnet_per_class_metrics.json (species-specific performance)")

print(f"\nThis completes the downstream evaluation for wildlife species classification.")
print("Use the generated metrics and visualizations to assess restoration model quality")  
print("and determine impact on ecological monitoring applications.")

print("\nNext steps for analysis:")
print("1. Review hierarchical accuracy to understand taxonomic-level performance")
print("2. Examine confusion matrices to identify commonly confused species pairs")
print("3. Analyze confidence distributions to set appropriate deployment thresholds")
print("4. Use sample image analysis to validate model behavior on challenging cases")