# Memory Optimization Guide for Binary Neural Networks with 224x224 Images

This notebook contains a comprehensive collection of memory optimization techniques for training Binary Neural Networks (BNNs) with 224x224 images on systems with limited GPU memory.

## Quick Start Guide

1. Set your dataset path in the "Set Dataset Path" cell
2. Choose a memory optimization strategy:
   - **Strategy A**: Use the ConvBNN224x224 model for full 224x224 resolution
   - **Strategy B**: Use feature extraction with a pre-trained model
   - **Strategy C**: Use patch-based processing
   - **Strategy D**: Use layer-freezing techniques

3. Run the complete recipe in the final cell

## Memory Optimization Techniques in this Notebook

- **Aggressive Memory Settings**:
  - Low batch size with gradient accumulation
  - Garbage collection after each batch
  - Reduced worker threads
  - Disabled pin memory

- **Model Architecture Optimizations**:
  - Binary activations and weights
  - Progressive dimensionality reduction
  - Convolutional feature extraction
  - Gradient checkpointing
  - Depthwise separable convolutions

- **Training Process Optimizations**:
  - Mixed precision training
  - Sample limiting per class
  - Feature pre-extraction
  - Patch-based processing
  - Model compression techniques

Choose the techniques that work best for your specific hardware limitations while maintaining the desired 224x224 image resolution.

# Memory-Efficient Binary Neural Network (BNN) for Plant Disease Classification - 224x224 Images

This notebook implements a Binary Neural Network using PyTorch for multiclass plant disease classification, optimized for handling 224x224 images on GPUs with limited memory.

## Memory Management Guide

This notebook has been optimized for running on GPUs with limited memory (under 4GB). The settings have been automatically adjusted to "aggressive" mode to prevent CUDA out-of-memory errors.

### Current Memory-Optimized Settings:
- Image resolution: 224x224 (standard image size)
- Batch size: 4 (reduced for memory efficiency)
- Hidden size: 256 (compact model architecture)
- Progressive dimensionality reduction (multiple embedding steps)
- Gradient accumulation: 8 steps (effective batch size of 32)
- Mixed precision training (FP16)

### Memory Monitoring
You can monitor GPU memory usage by checking the output of the memory monitoring tools included at the beginning of the notebook.

# Binary Neural Network (BNN) for Plant Disease Classification

This notebook implements a Binary Neural Network using PyTorch for multiclass plant disease classification. The BNN uses binary weights and activations to reduce model size and computational requirements while maintaining reasonable accuracy.

## Features:
- Binary weights and activations using sign function
- Processes 224x224 RGB images (standard size)
- Progressive dimensionality reduction for memory efficiency
- Binary hidden layers with binary weights
- Batch normalization for improved stability
- Dropout for regularization
- Learning rate scheduling for better convergence
- Multiclass output with softmax activation
- Mixed precision training
- Gradient accumulation

In [1]:
# Import necessary libraries
import os
import time
import gc
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# PyTorch imports
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset, random_split, Subset
import torchvision
import torchvision.transforms as transforms
from torchvision.datasets import ImageFolder

# Check for CUDA availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
print(f"PyTorch version: {torch.__version__}")

# Set random seeds for reproducibility
def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
    os.environ['PYTHONHASHSEED'] = str(seed)

set_seed()

# # Memory configuration - optimized for large images (224x224)
# memory_config = {
#     'batch_size': 1,  # Smaller batch size for larger images
#     'gradient_accumulation_steps': 4,  # Accumulate gradients over multiple batches
#     'gc_frequency': 5,  # Garbage collection frequency
#     'memory_efficient': True,  # Use memory-efficient techniques
#     'use_mixed_precision': True  # Use mixed precision training if available
# }

# Updated memory configuration with extreme optimization
memory_config = {
    'batch_size': 1,  # Absolute minimum
    'gradient_accumulation_steps': 8,  # Increase to compensate for tiny batch
    'gc_frequency': 1,  # Maximum GC
    'memory_efficient': True,
    'use_mixed_precision': True,
    'image_size': 128,  # Reduced image size
    'hidden_size': 128,  # Smaller model
    'embedding_size': 256,  # Smaller embeddings
    'max_samples_per_class': 50,  # Limit dataset size temporarily
    'num_workers': 0,  # No parallel data loading
    'pin_memory': False  # Disable pin memory
}

print("Applied extreme memory optimization settings")

# Use memory configuration
batch_size = memory_config['batch_size']
gradient_accumulation_steps = memory_config['gradient_accumulation_steps'] 
gc_frequency = memory_config['gc_frequency']
memory_efficient = memory_config['memory_efficient']
use_mixed_precision = memory_config['use_mixed_precision']
memory_mode = 'High Memory Optimization (224x224)'

print(f"Memory Configuration Mode: {memory_mode}")
print(f"Batch Size: {batch_size}")
print(f"Gradient Accumulation Steps: {gradient_accumulation_steps}")
print(f"Effective Batch Size: {batch_size * gradient_accumulation_steps}")
print(f"GC Frequency: {gc_frequency}")
print(f"Memory Efficient: {memory_efficient}")
print(f"Mixed Precision: {use_mixed_precision}")

Using device: cuda
PyTorch version: 2.7.1+cu126
Applied extreme memory optimization settings
Memory Configuration Mode: High Memory Optimization (224x224)
Batch Size: 1
Gradient Accumulation Steps: 8
Effective Batch Size: 8
GC Frequency: 1
Memory Efficient: True
Mixed Precision: True


In [2]:
# EXTREME MEMORY OPTIMIZATION
# Run this cell if you are still facing memory issues

# 1. Force CPU-only mode (completely bypass GPU)
# Uncomment this line to force CPU mode
# device = torch.device("cpu")
# print(f"FORCED CPU MODE: {device}")

# 2. More extreme memory config
extreme_memory_config = {
    'batch_size': 1,
    'gradient_accumulation_steps': 8,
    'gc_frequency': 1,
    'memory_efficient': True,
    'use_mixed_precision': False,  # Mixed precision might not work well on CPU
    'image_size': 64,  # Drastically reduced image size
    'hidden_size': 64,
    'embedding_size': 128,
    'max_samples_per_class': 20,  # Ultra-limited dataset size
    'num_workers': 0,
    'pin_memory': False,
    'num_hidden_layers': 1,  # Minimum layers
    'dropout_rate': 0.5,  # More aggressive dropout
}

# To use these extreme settings, uncomment the next line:
# memory_config = extreme_memory_config
# print("⚠️ EXTREME memory optimization settings applied!")

# 3. Enable model checkpointing
try:
    from torch.utils.checkpoint import checkpoint
    use_checkpointing = True
    print("Checkpointing available - will use to save memory")
except ImportError:
    use_checkpointing = False
    print("Checkpointing not available in this PyTorch version")

# 4. Empty CUDA cache more aggressively
torch.cuda.empty_cache()
gc.collect()

# 5. Display current GPU memory status
if torch.cuda.is_available():
    print(f"\nCurrent GPU memory usage:")
    print(f"- Allocated: {torch.cuda.memory_allocated() / (1024**3):.3f} GB")
    print(f"- Reserved:  {torch.cuda.memory_reserved() / (1024**3):.3f} GB")

Checkpointing available - will use to save memory

Current GPU memory usage:
- Allocated: 0.000 GB
- Reserved:  0.000 GB


In [3]:
# 224x224-SPECIFIC MEMORY OPTIMIZATIONS
# This cell provides optimizations while keeping 224x224 image size

# 1. Update memory config for 224x224 specifically
memory_config_224 = {
    # Keep the 224x224 image size
    'image_size': 224,
    
    # Data loading optimizations
    'batch_size': 1,  # Absolute minimum
    'gradient_accumulation_steps': 32,  # Very high accumulation to compensate for batch=1
    'max_samples_per_class': 20,  # Very limited samples
    'num_workers': 0,
    'pin_memory': False,
    
    # Memory aggressive settings
    'gc_frequency': 1,  # Run GC after every batch
    'memory_efficient': True,
    'use_mixed_precision': True,
    
    # More conservative model architecture
    'hidden_size': 64,  # Drastically reduced
    'embedding_size': 128,  # Drastically reduced
    'num_hidden_layers': 1,
}

# Uncomment to apply 224x224 specific settings
# memory_config.update(memory_config_224)
# print("Applied 224x224-specific memory optimizations")

# 2. Set PYTORCH_CUDA_ALLOC_CONF for more aggressive memory management
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True,max_split_size_mb:32'

# 3. Enable gradient checkpointing (requires PyTorch >= 1.4)
USE_GRADIENT_CHECKPOINTING = True

# 4. Empty cache before starting
torch.cuda.empty_cache()
gc.collect()

# 5. Display current memory usage
if torch.cuda.is_available():
    print(f"Current GPU memory usage before optimizations:")
    print(f"- Allocated: {torch.cuda.memory_allocated() / (1024**3):.3f} GB")
    print(f"- Reserved: {torch.cuda.memory_reserved() / (1024**3):.3f} GB")

Current GPU memory usage before optimizations:
- Allocated: 0.000 GB
- Reserved: 0.000 GB


In [4]:
# Additional imports for enhanced visualization and data export
import pandas as pd
import seaborn as sns
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.metrics import precision_recall_fscore_support
import time
import datetime
import os

# Set style for better plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Create results directory
os.makedirs('results', exist_ok=True)
print("Results directory created: ./results/")

Results directory created: ./results/


In [5]:
def evaluate_bnn_memory_efficient(model, test_loader, criterion, device, class_names, batch_limit=None, use_mixed_precision=False):
    """
    Memory-efficient evaluation function for BNN model
    
    Args:
        model: The trained BNN model
        test_loader: DataLoader for the test dataset
        criterion: Loss function
        device: Device to run evaluation on
        class_names: List of class names
        batch_limit: Limit the number of batches to evaluate (for debugging)
        use_mixed_precision: Whether to use mixed precision
        
    Returns:
        Dictionary with evaluation metrics
    """
    # Import needed modules at the beginning of the function to avoid UnboundLocalError
    import torch
    import torch.nn.functional as F
    import time
    import contextlib
    
    model.eval()
    
    # Set up metrics
    test_loss = 0.0
    correct = 0
    total = 0
    
    # Initialize confusion matrix
    num_classes = len(class_names)
    confusion_matrix = torch.zeros(num_classes, num_classes)
    
    # Prepare for per-class metrics
    class_correct = [0] * num_classes
    class_total = [0] * num_classes
    
    # Set up for precision, recall, F1
    true_positives = [0] * num_classes
    false_positives = [0] * num_classes
    false_negatives = [0] * num_classes
    
    # Store some sample images for visualization
    sample_images = []
    sample_labels = []
    sample_preds = []
    samples_collected = 0
    max_samples = 10  # Maximum number of samples to collect
    
    # For ROC curve
    all_targets = []
    all_probs = []
    
    # Check if mixed precision is available
    mixed_precision_available = use_mixed_precision and hasattr(torch, 'autocast')
    
    # Determine the appropriate autocast context manager based on PyTorch version
    if mixed_precision_available:
        try:
            from torch.cuda.amp import autocast
            # Check if the version supports device_type parameter
            torch_version = torch.__version__
            supports_device_type = int(torch_version.split('.')[0]) >= 1 and int(torch_version.split('.')[1]) >= 10
            
            # Define context manager with appropriate parameters
            if supports_device_type:
                autocast_context = lambda: autocast(device_type=device.type)
            else:
                # Older PyTorch versions only support CUDA and don't need device_type
                autocast_context = lambda: autocast()
        except ImportError:
            try:
                from torch.amp import autocast
                autocast_context = lambda: autocast(device_type=device.type)
            except ImportError:
                mixed_precision_available = False
                autocast_context = contextlib.nullcontext
    else:
        autocast_context = contextlib.nullcontext

    # Track processing time
    start_time = time.time()
    
    # Disable gradient computation for evaluation
    with torch.no_grad():
        # Process batches
        for batch_idx, (data, targets) in enumerate(test_loader):
            # Respect batch limit if specified
            if batch_limit is not None and batch_idx >= batch_limit:
                print(f"Evaluating model on {batch_limit} batches (limited)...")
                break
                
            if batch_idx == 0:
                print(f"Evaluating model on {len(test_loader)} batches (all)...")
                
            # Move data to device
            data, targets = data.to(device), targets.to(device)
            
            # Use mixed precision if available
            if mixed_precision_available:
                with autocast_context():
                    outputs = model(data)
                    loss = criterion(outputs, targets)
            else:
                outputs = model(data)
                loss = criterion(outputs, targets)
                
            # Accumulate loss
            test_loss += loss.item()
            
            # Get predictions
            _, predicted = outputs.max(1)
            
            # Update metrics
            total += targets.size(0)
            correct += predicted.eq(targets).sum().item()
            
            # Store probabilities for ROC curve (using softmax)
            probs = F.softmax(outputs, dim=1)
            all_targets.extend(targets.cpu().numpy())
            all_probs.extend(probs.cpu().numpy())
            
            # Update confusion matrix
            for t, p in zip(targets.view(-1), predicted.view(-1)):
                confusion_matrix[t.long(), p.long()] += 1
                
            # Update per-class metrics
            for i in range(len(targets)):
                label = targets[i].item()
                pred = predicted[i].item()
                class_total[label] += 1
                if label == pred:
                    class_correct[label] += 1
                    
            # Update precision, recall metrics
            for c in range(num_classes):
                true_positives[c] += ((predicted == c) & (targets == c)).sum().item()
                false_positives[c] += ((predicted == c) & (targets != c)).sum().item()
                false_negatives[c] += ((predicted != c) & (targets == c)).sum().item()
                
            # Collect sample images for visualization
            if samples_collected < max_samples:
                # Get a few samples from this batch
                num_to_collect = min(max_samples - samples_collected, data.size(0))
                sample_images.extend(data[:num_to_collect].cpu())
                sample_labels.extend(targets[:num_to_collect].cpu().numpy())
                sample_preds.extend(predicted[:num_to_collect].cpu().numpy())
                samples_collected += num_to_collect
                
            # Clean up memory
            del data, targets, outputs, predicted
            
    # Compute average loss
    test_loss /= len(test_loader)
    
    # Compute accuracy
    accuracy = 100. * correct / total
    
    # Compute per-class accuracy
    class_accuracy = [100. * class_correct[i] / max(1, class_total[i]) for i in range(num_classes)]
    
    # Compute precision, recall, F1
    precision = [true_positives[i] / max(1, true_positives[i] + false_positives[i]) for i in range(num_classes)]
    recall = [true_positives[i] / max(1, true_positives[i] + false_negatives[i]) for i in range(num_classes)]
    f1_score = [2 * precision[i] * recall[i] / max(1e-6, precision[i] + recall[i]) for i in range(num_classes)]
    
    # Calculate macro averages
    macro_precision = sum(precision) / num_classes
    macro_recall = sum(recall) / num_classes
    macro_f1 = sum(f1_score) / num_classes
    
    # Convert confusion matrix to percentage
    confusion_percentage = confusion_matrix.diag() / confusion_matrix.sum(1)
    
    # Track total processing time
    eval_time = time.time() - start_time
    
    # Return a comprehensive metrics dictionary
    metrics = {
        'loss': test_loss,
        'accuracy': accuracy,
        'class_accuracy': dict(zip(class_names, class_accuracy)),
        'precision': dict(zip(class_names, precision)),
        'recall': dict(zip(class_names, recall)),
        'f1_score': dict(zip(class_names, f1_score)),
        'macro_precision': macro_precision,
        'macro_recall': macro_recall,
        'macro_f1': macro_f1,
        'confusion_matrix': confusion_matrix.cpu().numpy(),
        'confusion_percentage': confusion_percentage.cpu().numpy(),
        'class_distribution': dict(zip(class_names, class_total)),
        'evaluation_time': eval_time,
        'samples': {
            'images': sample_images,
            'true_labels': sample_labels,
            'predicted_labels': sample_preds,
        },
        'roc_data': {
            'targets': all_targets,
            'probs': all_probs,
        }
    }
    
    return metrics

In [6]:
# Memory-Efficient Training Function with Version-Compatible Autocast
def train_memory_efficient(model, train_loader, criterion, optimizer, num_epochs, device,
                          scheduler=None, gradient_accumulation_steps=1, memory_efficient=True,
                          gc_frequency=10, use_mixed_precision=False, 
                          early_stopping_patience=None):
    """
    Memory-efficient training function for BNN model
    
    Args:
        model: The BNN model
        train_loader: DataLoader for the training dataset
        criterion: Loss function
        optimizer: Optimizer for training
        num_epochs: Number of training epochs
        device: Device to train on (cpu or cuda)
        scheduler: Learning rate scheduler (optional)
        gradient_accumulation_steps: Number of steps to accumulate gradients
        memory_efficient: Whether to use memory efficiency techniques
        gc_frequency: How often to perform garbage collection
        use_mixed_precision: Whether to use mixed precision training
        early_stopping_patience: Patience for early stopping (optional)
        
    Returns:
        Dictionary with training metrics
    """
    # Import torch at the beginning of the function to avoid UnboundLocalError
    import torch
    import time
    import gc
    import contextlib
    
    # Training history
    history = []
    train_losses = []
    train_accuracies = []
    
    # For early stopping
    best_loss = float('inf')
    patience_counter = 0
    
    # For timing
    epoch_times = []
    
    # Mixed precision setup
    mixed_precision_available = use_mixed_precision and hasattr(torch, 'autocast')
    
    # Setup mixed precision tools if available
    if mixed_precision_available:
        try:
            from torch.cuda.amp import autocast, GradScaler
            scaler = GradScaler()
        except ImportError:
            try:
                from torch.amp import autocast, GradScaler
                scaler = GradScaler()
            except ImportError:
                mixed_precision_available = False
                scaler = None
    else:
        scaler = None

    # Determine the appropriate autocast context manager based on PyTorch version
    if mixed_precision_available:
        try:
            from torch.cuda.amp import autocast
            # Check if the version supports device_type parameter
            torch_version = torch.__version__
            supports_device_type = int(torch_version.split('.')[0]) >= 1 and int(torch_version.split('.')[1]) >= 10
            
            # Define context manager with appropriate parameters
            if supports_device_type:
                autocast_context = lambda: autocast(device_type=device.type)
            else:
                # Older PyTorch versions only support CUDA and don't need device_type
                autocast_context = lambda: autocast()
        except ImportError:
            try:
                from torch.amp import autocast
                autocast_context = lambda: autocast(device_type=device.type)
            except ImportError:
                mixed_precision_available = False
                autocast_context = contextlib.nullcontext
    else:
        autocast_context = contextlib.nullcontext
    
    # Main training loop
    print(f"Starting training for {num_epochs} epochs with mixed precision: {mixed_precision_available}")
    
    for epoch in range(num_epochs):
        epoch_start_time = time.time()
        
        model.train()
        running_loss = 0.0
        running_corrects = 0
        total_samples = 0
        
        # Reset gradients at the start of each epoch for consistent behavior
        optimizer.zero_grad()
        
        # Process batches
        for batch_idx, (inputs, labels) in enumerate(train_loader):
            # Move data to device
            inputs, labels = inputs.to(device), labels.to(device)
            
            # Forward pass with mixed precision if available
            if mixed_precision_available:
                with autocast_context():
                    outputs = model(inputs)
                    loss = criterion(outputs, labels)
                    
                    # Adjust loss for gradient accumulation
                    loss = loss / gradient_accumulation_steps
                
                # Backward pass with gradient scaling
                scaler.scale(loss).backward()
                
                # Step with gradient accumulation
                if (batch_idx + 1) % gradient_accumulation_steps == 0:
                    scaler.step(optimizer)
                    scaler.update()
                    optimizer.zero_grad()
            else:
                # Standard forward pass
                outputs = model(inputs)
                loss = criterion(outputs, labels)
                
                # Adjust loss for gradient accumulation
                loss = loss / gradient_accumulation_steps
                
                # Standard backward pass
                loss.backward()
                
                # Step with gradient accumulation
                if (batch_idx + 1) % gradient_accumulation_steps == 0:
                    optimizer.step()
                    optimizer.zero_grad()
            
            # Calculate metrics
            _, preds = torch.max(outputs, 1)
            running_loss += loss.item() * gradient_accumulation_steps  # Rescale loss for reporting
            running_corrects += torch.sum(preds == labels.data).item()
            total_samples += labels.size(0)
            
            # Memory cleanup
            if memory_efficient and (batch_idx + 1) % gc_frequency == 0:
                del inputs, labels, outputs, preds, loss
                torch.cuda.empty_cache()
                gc.collect()
                
        # Make sure to step optimizer for the last batch if not divisible
        if mixed_precision_available and train_loader.__len__() % gradient_accumulation_steps != 0:
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()
        elif not mixed_precision_available and train_loader.__len__() % gradient_accumulation_steps != 0:
            optimizer.step()
            optimizer.zero_grad()
            
        # Calculate epoch metrics
        epoch_loss = running_loss / len(train_loader)
        epoch_acc = running_corrects / total_samples * 100.0
        
        # Step scheduler if provided
        if scheduler is not None:
            if isinstance(scheduler, torch.optim.lr_scheduler.ReduceLROnPlateau):
                scheduler.step(epoch_loss)
            else:
                scheduler.step()
            
        # Record metrics
        train_losses.append(epoch_loss)
        train_accuracies.append(epoch_acc)
        
        # Record epoch time
        epoch_end_time = time.time()
        epoch_time = epoch_end_time - epoch_start_time
        epoch_times.append(epoch_time)
        
        # Print epoch results
        print(f'Epoch {epoch+1}/{num_epochs} - Loss: {epoch_loss:.4f}, Acc: {epoch_acc:.2f}%, Time: {epoch_time:.2f}s')
        
        # Save epoch history
        history.append({
            'epoch': epoch + 1,
            'loss': epoch_loss,
            'accuracy': epoch_acc,
            'time': epoch_time,
            'learning_rate': optimizer.param_groups[0]['lr']
        })
        
        # Memory cleanup at the end of epoch
        if memory_efficient:
            torch.cuda.empty_cache()
            gc.collect()
            
        # Early stopping check
        if early_stopping_patience is not None:
            if epoch_loss < best_loss:
                best_loss = epoch_loss
                patience_counter = 0
            else:
                patience_counter += 1
                
            if patience_counter >= early_stopping_patience:
                print(f'Early stopping at epoch {epoch+1}')
                break
                
    # Return training metrics
    training_summary = {
        'history': history,
        'train_losses': train_losses,
        'train_accuracies': train_accuracies,
        'epoch_times': epoch_times,
        'total_time': sum(epoch_times),
        'final_loss': train_losses[-1],
        'final_accuracy': train_accuracies[-1]
    }
    
    return training_summary

In [7]:
# Memory Monitoring and Optimization Utilities
import gc
import psutil

def print_gpu_memory_stats():
    """Print detailed GPU memory statistics"""
    if torch.cuda.is_available():
        print(f"GPU: {torch.cuda.get_device_name(0)}")
        allocated = torch.cuda.memory_allocated() / (1024 ** 3)
        reserved = torch.cuda.memory_reserved() / (1024 ** 3)
        max_allocated = torch.cuda.max_memory_allocated() / (1024 ** 3)
        
        print(f"Memory allocated: {allocated:.2f} GB")
        print(f"Memory reserved: {reserved:.2f} GB")
        print(f"Max memory allocated: {max_allocated:.2f} GB")
        
        if hasattr(torch.cuda, 'memory_summary'):
            print("\nMemory Summary:")
            print(torch.cuda.memory_summary(abbreviated=True))
    else:
        print("CUDA not available")

def print_system_memory():
    """Print system memory usage"""
    vm = psutil.virtual_memory()
    print(f"System memory: {vm.total / (1024**3):.1f} GB total, " 
          f"{vm.available / (1024**3):.1f} GB available, "
          f"{vm.percent}% used")

def optimize_memory(mode='aggressive'):
    """Apply memory optimization settings based on selected mode"""
    if mode == 'aggressive':
        # Most aggressive memory saving settings for 224x224 images
        os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True,max_split_size_mb:32'
        torch.backends.cudnn.benchmark = False
        torch.backends.cudnn.deterministic = True
        return {
            'image_size': 224,  # 224x224 for this notebook
            'batch_size': 4,  # Smaller batch size for larger images
            'hidden_size': 256,
            'embedding_size': 512,
            'num_hidden_layers': 1,
            'gradient_accumulation': 8,
            'max_samples_per_class': 100  # Limit samples for 224x224 images
        }
    elif mode == 'moderate':
        # Balanced memory saving
        os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True,max_split_size_mb:64'
        torch.backends.cudnn.benchmark = True
        return {
            'image_size': 224,  # 224x224 for this notebook
            'batch_size': 6,
            'hidden_size': 320,
            'embedding_size': 640,
            'num_hidden_layers': 1,
            'gradient_accumulation': 4,
            'max_samples_per_class': 150
        }
    else:  # 'performance' mode
        # Optimized for performance, higher memory usage
        os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'
        torch.backends.cudnn.benchmark = True
        return {
            'image_size': 224,  # 224x224 for this notebook
            'batch_size': 8,
            'hidden_size': 512,
            'embedding_size': 1024,
            'num_hidden_layers': 2,
            'gradient_accumulation': 2,
            'max_samples_per_class': None
        }

# Check current memory status
print("Initial memory status:")
print_system_memory()
print_gpu_memory_stats()

# Memory optimization mode - set this to 'performance', 'moderate', or 'aggressive'
memory_mode = 'aggressive'  # Using aggressive mode to avoid CUDA OOM errors
print(f"\nUsing {memory_mode} memory optimization settings")
memory_config = optimize_memory(memory_mode)
print(f"Recommended settings: {memory_config}")

# Create results directory
os.makedirs('results', exist_ok=True)
print("\nResults directory created: ./results/")

Initial memory status:
System memory: 15.0 GB total, 9.0 GB available, 39.6% used
GPU: NVIDIA GeForce GTX 1650
Memory allocated: 0.00 GB
Memory reserved: 0.00 GB
Max memory allocated: 0.00 GB

Memory Summary:
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| Active memory         |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| Requested memory      |      0 B   |      0 B   |      0 B   |      0 B   |
|----------

In [8]:
# Set Dataset Path
# This is the path to your plant disease dataset
# You need to modify this path to point to your actual dataset location
# The dataset should have a folder structure like:
# - dataset_path/Healthy_Soyabean/
# - dataset_path/rust/
# - dataset_path/Soyabean_Mosaic/
# or similar class folders

# dataset_path = "/home/dragoon/Downloads/MH-SoyaHealthVision An Indian UAV and Leaf Image Dataset for Integrated Crop Health Assessment/Soyabean_UAV-Based_Image_Dataset" # MODIFY THIS PATH to your dataset location
dataset_path = "/home/dragoon/Downloads/MH-SoyaHealthVision An Indian UAV and Leaf Image Dataset for Integrated Crop Health Assessment/dataset" # MODIFY THIS PATH to your dataset location

print(f"Dataset path set to: {dataset_path}")
print("Make sure this path contains your plant disease class folders")

# Check if the path exists
import os
if os.path.exists(dataset_path):
    print("✅ Path exists!")
    # List the folders (classes) in the dataset
    class_folders = [f for f in os.listdir(dataset_path) if os.path.isdir(os.path.join(dataset_path, f))]
    print(f"Found {len(class_folders)} class folders: {class_folders}")
else:
    print("❌ Path does not exist! Please update the dataset_path variable to your actual dataset location.")

Dataset path set to: /home/dragoon/Downloads/MH-SoyaHealthVision An Indian UAV and Leaf Image Dataset for Integrated Crop Health Assessment/dataset
Make sure this path contains your plant disease class folders
✅ Path exists!
Found 4 class folders: ['Soyabean Semilooper_Pest_Attack', 'rust', 'Healthy_Soyabean', 'Soyabean_Mosaic']


In [9]:
# Load Plant Disease Dataset
from torchvision import datasets, transforms
from PIL import Image
import os

def load_plant_disease_dataset(dataset_path, image_size=224, memory_efficient=True, max_samples_per_class=None):
    """
    Load the plant disease dataset from the specified path with memory optimization options.
    
    Parameters:
    -----------
    dataset_path : str
        Path to dataset directory
    image_size : int
        Size to resize images (default: 224)
    memory_efficient : bool
        If True, loads dataset with memory optimization (returns DataLoader instead of tensors)
    max_samples_per_class : int or None
        If specified, limits the number of samples per class (for testing with less memory)
    
    Dataset structure:
    - dataset_path/Healthy_Soyabean/
    - dataset_path/rust/
    - dataset_path/Soyabean_Mosaic/
    """
    
    # Define transforms for preprocessing
    transform = transforms.Compose([
        transforms.Resize((image_size, image_size)),  # Resize to specified size
        transforms.ToTensor(),  # Convert to tensor and normalize to [0,1]
        transforms.Normalize(mean=[0.485, 0.456, 0.406], 
                           std=[0.229, 0.224, 0.225])  # ImageNet normalization
    ])
    
    # Load dataset using ImageFolder
    dataset = datasets.ImageFolder(root=dataset_path, transform=transform)
    
    # Limit samples per class if specified (for memory-constrained environments)
    if max_samples_per_class is not None:
        class_indices = {}
        for idx, (_, class_idx) in enumerate(dataset.samples):
            if class_idx not in class_indices:
                class_indices[class_idx] = []
            if len(class_indices[class_idx]) < max_samples_per_class:
                class_indices[class_idx].append(idx)
        
        # Flatten indices list
        limited_indices = [idx for indices in class_indices.values() for idx in indices]
        
        # Create subset
        from torch.utils.data import Subset
        dataset = Subset(dataset, limited_indices)
        print(f"Dataset limited to {max_samples_per_class} samples per class ({len(limited_indices)} total samples)")
    
    # Print class mapping
    print("Class mapping:")
    for idx, class_name in enumerate(dataset.dataset.classes if hasattr(dataset, 'dataset') else dataset.classes):
        print(f"  {idx}: {class_name}")
    
    # Return dataset directly for memory-efficient use
    if memory_efficient:
        # Count samples per class for data statistics
        class_counts = {}
        if hasattr(dataset, 'dataset'):  # If it's a Subset
            for _, class_idx in [dataset.dataset.samples[i] for i in dataset.indices]:
                class_counts[class_idx] = class_counts.get(class_idx, 0) + 1
            classes = dataset.dataset.classes
        else:
            for _, class_idx in dataset.samples:
                class_counts[class_idx] = class_counts.get(class_idx, 0) + 1
            classes = dataset.classes
        
        # Display class distribution
        print(f"\nClass distribution:")
        for class_idx, count in class_counts.items():
            class_name = classes[class_idx]
            print(f"  {class_name}: {count} images")
        
        return dataset, classes
    else:
        # Convert to tensors (this loads ALL images into memory)
        data_loader = torch.utils.data.DataLoader(dataset, batch_size=len(dataset), shuffle=False)
        X, y = next(iter(data_loader))
        
        # Get class names from dataset
        if hasattr(dataset, 'dataset'):
            class_names = dataset.dataset.classes
        else:
            class_names = dataset.classes
        
        return X, y, class_names

# Set memory optimization options from memory config
memory_efficient = True  # Load dataset in memory-efficient way
max_samples_per_class = memory_config['max_samples_per_class']  # From memory config

# Image size settings from memory config
image_size = memory_config.get('image_size', 224)
print(f"Image size set to {image_size}x{image_size} (from memory config)")

# Load your plant disease dataset
# Using the dataset_path defined in the previous cell
print("Loading plant disease dataset...")
print(f"Dataset path: {dataset_path}")

if memory_efficient:
    # Memory-efficient loading (doesn't load all images at once)
    plant_dataset, class_names = load_plant_disease_dataset(
        dataset_path, 
        image_size=image_size, 
        memory_efficient=memory_efficient,
        max_samples_per_class=max_samples_per_class
    )
    
    print(f"\nDataset loaded successfully in memory-efficient mode!")
    print(f"Total samples: {len(plant_dataset)}")
    print(f"Number of classes: {len(class_names)}")
    print(f"Class names: {class_names}")
else:
    # Original approach (loads all into memory at once)
    X, y, class_names = load_plant_disease_dataset(
        dataset_path, 
        image_size=image_size, 
        memory_efficient=False,
        max_samples_per_class=max_samples_per_class
    )
    
    print(f"\nDataset loaded successfully!")
    print(f"Dataset shape: {X.shape}")
    print(f"Labels shape: {y.shape}")
    print(f"Number of classes: {len(torch.unique(y))}")
    print(f"Class distribution: {torch.bincount(y)}")
    print(f"Class names: {class_names}")

Image size set to 224x224 (from memory config)
Loading plant disease dataset...
Dataset path: /home/dragoon/Downloads/MH-SoyaHealthVision An Indian UAV and Leaf Image Dataset for Integrated Crop Health Assessment/dataset
Dataset limited to 100 samples per class (169 total samples)
Class mapping:
  0: Healthy_Soyabean
  1: Soyabean Semilooper_Pest_Attack
  2: Soyabean_Mosaic
  3: rust

Class distribution:
  Healthy_Soyabean: 49 images
  Soyabean Semilooper_Pest_Attack: 40 images
  Soyabean_Mosaic: 40 images
  rust: 40 images

Dataset loaded successfully in memory-efficient mode!
Total samples: 169
Number of classes: 4
Class names: ['Healthy_Soyabean', 'Soyabean Semilooper_Pest_Attack', 'Soyabean_Mosaic', 'rust']


In [10]:
# Split Dataset and Create Data Loaders
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader, SubsetRandomSampler

# Define batch size from memory_config for consistency
batch_size = memory_config['batch_size']  # Use memory-optimized batch size
print(f"Using batch size {batch_size} for {image_size}x{image_size} images")

if memory_efficient:
    # Create train/test split indices
    from torch.utils.data import random_split
    
    # Use fixed random seed for reproducibility
    torch.manual_seed(42)
    
    # Get dataset size
    dataset_size = len(plant_dataset)
    test_size = int(0.2 * dataset_size)
    train_size = dataset_size - test_size
    
    # Create random train/test split
    train_dataset, test_dataset = random_split(plant_dataset, [train_size, test_size])
    
    # Create data loaders with smaller batch size for memory efficiency
    train_loader = DataLoader(
        train_dataset, 
        batch_size=batch_size, 
        shuffle=True,
        num_workers=0,  # Increase if your system allows
        pin_memory=False  # Use pinned memory for faster GPU transfer
    )
    
    test_loader = DataLoader(
        test_dataset, 
        batch_size=batch_size, 
        shuffle=True,
        num_workers=0,  # Increase if your system allows
        pin_memory=False
    )
    
    # Print info about data loaders
    print(f"Training set: {len(train_dataset)} samples")
    print(f"Test set: {len(test_dataset)} samples")
    print(f"Number of training batches: {len(train_loader)}")
    print(f"Number of test batches: {len(test_loader)}")
    
    # Store for later reference
    dataset_size = {'train': len(train_dataset), 'test': len(test_dataset)}
    
else:
    # Original approach with pre-loaded tensors
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    print(f"Training set: {X_train.shape[0]} samples")
    print(f"Test set: {X_test.shape[0]} samples")

    # Create data loaders
    train_dataset = TensorDataset(X_train, y_train)
    test_dataset = TensorDataset(X_test, y_test)

    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
    
    print(f"Number of training batches: {len(train_loader)}")
    print(f"Number of test batches: {len(test_loader)}")
    
    # Store for later reference
    dataset_size = {'train': len(X_train), 'test': len(X_test)}

Using batch size 4 for 224x224 images
Training set: 136 samples
Test set: 33 samples
Number of training batches: 34
Number of test batches: 9


In [11]:
# Binary Activation Function
class BinaryActivation(torch.autograd.Function):
    """
    Binary activation function using the sign function.
    Forward: sign(x) = {-1 if x < 0, +1 if x >= 0}
    Backward: Straight-through estimator (STE) - passes gradients through unchanged
    """
    
    @staticmethod
    def forward(ctx, input):
        # Apply sign function: -1 for negative, +1 for non-negative
        return torch.sign(input)
    
    @staticmethod
    def backward(ctx, grad_output):
        # Straight-through estimator: pass gradients through unchanged
        # This allows gradients to flow back during training
        return grad_output

def binary_activation(x):
    """Wrapper function for binary activation"""
    return BinaryActivation.apply(x)

In [12]:
# Binary Linear Layer
class BinaryLinear(nn.Module):
    """
    Binary Linear layer with binary weights.
    Weights are binarized using the sign function during forward pass.
    """
    
    def __init__(self, in_features, out_features, bias=True):
        super(BinaryLinear, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        
        # Initialize weights using normal distribution
        self.weight = nn.Parameter(torch.randn(out_features, in_features) * 0.1)
        
        if bias:
            self.bias = nn.Parameter(torch.zeros(out_features))
        else:
            self.register_parameter('bias', None)
    
    def forward(self, input):
        # Binarize weights using sign function
        binary_weight = torch.sign(self.weight)
        
        # Perform linear transformation with binary weights
        output = F.linear(input, binary_weight, self.bias)
        
        return output
    
    def extra_repr(self):
        return f'in_features={self.in_features}, out_features={self.out_features}, bias={self.bias is not None}'

In [13]:
# Binary Neural Network Model
class BinaryNeuralNetwork(nn.Module):
    """
    Memory-efficient Binary Neural Network for multiclass plant disease classification.
    
    Architecture:
    - Input: Flattened RGB images (3*image_size*image_size features)
    - Progressive dimensionality reduction: Multiple steps to reduce memory usage
    - Feature embedding: Further reduces dimensionality for memory efficiency 
    - Binary Hidden Layers: Binary linear layers with binary activation
    - Output Layer: Regular linear layer for class logits
    - Final: Softmax for multiclass prediction
    """
    
    def __init__(self, input_size=3*224*224, hidden_size=256, num_classes=3, 
                 num_hidden_layers=1, embedding_size=512, dropout_rate=0.2):
        super(BinaryNeuralNetwork, self).__init__()
        
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_classes = num_classes
        self.num_hidden_layers = num_hidden_layers
        self.embedding_size = embedding_size
        
        # Calculate intermediate sizes for progressive dimension reduction using more aggressive reduction
        # This prevents OOM errors by using smaller intermediate layers
        intermediate_size1 = min(input_size // 8, 16384)  # More aggressive first reduction
        intermediate_size2 = min(intermediate_size1 // 4, 4096)  # Second reduction
        intermediate_size3 = min(intermediate_size2 // 2, 2048)  # Additional reduction
        intermediate_size4 = min(intermediate_size3 // 2, 1024)  # Final intermediate step before embedding
        
        print(f"Progressive dimension reduction: {input_size} → {intermediate_size1} → {intermediate_size2} → {intermediate_size3} → {intermediate_size4} → {embedding_size}")
        
        # Initial dimensionality reduction for memory efficiency - progressive steps with more stages
        self.embedding = nn.Sequential(
            # First reduction
            nn.Linear(input_size, intermediate_size1),
            nn.ReLU(),
            nn.BatchNorm1d(intermediate_size1),
            nn.Dropout(dropout_rate),
            
            # Second reduction
            nn.Linear(intermediate_size1, intermediate_size2),
            nn.ReLU(),
            nn.BatchNorm1d(intermediate_size2),
            nn.Dropout(dropout_rate),
            
            # Third reduction
            nn.Linear(intermediate_size2, intermediate_size3),
            nn.ReLU(),
            nn.BatchNorm1d(intermediate_size3),
            nn.Dropout(dropout_rate),
            
            # Fourth reduction (added for more gradual reduction)
            nn.Linear(intermediate_size3, intermediate_size4),
            nn.ReLU(),
            nn.BatchNorm1d(intermediate_size4),
            nn.Dropout(dropout_rate),
            
            # Final embedding
            nn.Linear(intermediate_size4, embedding_size),
            nn.ReLU(),
            nn.BatchNorm1d(embedding_size),
        )
        
        # First binary layer after embedding
        self.input_binary = BinaryLinear(embedding_size, hidden_size)
        
        # Multiple hidden binary layers
        self.hidden_layers = nn.ModuleList([
            BinaryLinear(hidden_size, hidden_size) for _ in range(num_hidden_layers)
        ])
        
        # Output layer: Regular linear layer for final classification
        self.output_layer = nn.Linear(hidden_size, num_classes)
        
        # Batch normalization for better training stability
        self.batch_norms = nn.ModuleList([
            nn.BatchNorm1d(hidden_size) for _ in range(num_hidden_layers + 1)  # +1 for input binary layer
        ])
        
        # Dropout for regularization
        self.dropout = nn.Dropout(dropout_rate)
        
    def forward(self, x):
        # Flatten input if it's not already flattened
        if len(x.shape) > 2:
            x = x.view(x.size(0), -1)  # Flatten to (batch_size, input_size)
        
        # Initial embedding to reduce dimensionality (memory efficient)
        x = self.embedding(x)
        
        # First binary layer
        x = self.input_binary(x)
        x = binary_activation(x)
        x = self.batch_norms[0](x)
        x = self.dropout(x)
        
        # Process through additional hidden binary layers
        for i in range(self.num_hidden_layers):
            x = self.hidden_layers[i](x)
            x = binary_activation(x)  # Binary activation function
            x = self.batch_norms[i+1](x)  # Apply batch normalization
            x = self.dropout(x)
        
        # Output layer (no activation - raw logits)
        logits = self.output_layer(x)
        
        return logits
    
    def predict_proba(self, x):
        """Get class probabilities using softmax"""
        with torch.no_grad():
            logits = self.forward(x)
            probabilities = F.softmax(logits, dim=1)
        return probabilities
    
    def predict(self, x):
        """Get predicted class labels"""
        with torch.no_grad():
            logits = self.forward(x)
            predictions = torch.argmax(logits, dim=1)
        return predictions

In [14]:
# Memory-Optimized Binary Neural Network Model with Checkpointing
class BinaryNeuralNetworkCheckpointed(nn.Module):
    """
    Ultra memory-efficient Binary Neural Network using checkpointing and simplified architecture.
    """
    def __init__(self, input_size=3*64*64, hidden_size=64, num_classes=3, dropout_rate=0.3):
        super(BinaryNeuralNetworkCheckpointed, self).__init__()
        
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_classes = num_classes
        
        # Simplified embedding with fewer stages to reduce memory
        embedding_size = hidden_size * 2
        intermediate_size1 = min(input_size // 4, 2048) 
        
        # Create separate embedding layers for checkpointing
        self.embed1 = nn.Sequential(
            nn.Linear(input_size, intermediate_size1),
            nn.ReLU(),
            nn.BatchNorm1d(intermediate_size1),
            nn.Dropout(dropout_rate)
        )
        
        self.embed2 = nn.Sequential(
            nn.Linear(intermediate_size1, embedding_size),
            nn.ReLU(),
            nn.BatchNorm1d(embedding_size)
        )
        
        # Binary layers
        self.binary_layer = BinaryLinear(embedding_size, hidden_size)
        self.bn = nn.BatchNorm1d(hidden_size)
        self.dropout = nn.Dropout(dropout_rate)
        
        # Output layer
        self.output_layer = nn.Linear(hidden_size, num_classes)
        
        print(f"Memory-optimized BNN created with input={input_size}, hidden={hidden_size}, embedding={embedding_size}")
        
    def forward(self, x):
        # Flatten input
        if len(x.shape) > 2:
            x = x.view(x.size(0), -1)
            
        # Use checkpointing for memory efficiency if available
        if 'use_checkpointing' in globals() and use_checkpointing:
            from torch.utils.checkpoint import checkpoint
            x = checkpoint(self.embed1, x)
            x = checkpoint(self.embed2, x)
        else:
            x = self.embed1(x)
            x = self.embed2(x)
        
        # Binary layer (no checkpointing needed, it's small)
        x = self.binary_layer(x)
        x = binary_activation(x)
        x = self.bn(x)
        x = self.dropout(x)
        
        # Output layer
        logits = self.output_layer(x)
        return logits
    
    def predict_proba(self, x):
        with torch.no_grad():
            logits = self.forward(x)
            probabilities = F.softmax(logits, dim=1)
        return probabilities
    
    def predict(self, x):
        with torch.no_grad():
            logits = self.forward(x)
            predictions = torch.argmax(logits, dim=1)
        return predictions

In [15]:
# Ultra Memory-Efficient BNN Model for 224x224 Images
class BNN224x224MemoryOptimized(nn.Module):
    """
    Memory-optimized BNN specifically designed for 224x224 images
    using advanced techniques to minimize memory usage.
    """
    def __init__(self, input_size=3*224*224, hidden_size=64, num_classes=3, dropout_rate=0.5):
        super(BNN224x224MemoryOptimized, self).__init__()
        
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_classes = num_classes
        
        # Import torch utils for checkpointing
        from torch.utils.checkpoint import checkpoint_sequential
        
        # STRATEGY 1: Use Convolutional layers instead of fully connected for dimensionality reduction
        # This drastically reduces parameters vs. flattening to 150528 (3*224*224) dimensions
        self.conv_reducer = nn.Sequential(
            # Conv block 1 - reduce to 112x112
            nn.Conv2d(3, 8, kernel_size=3, padding=1),
            nn.MaxPool2d(2),  # 224 -> 112
            nn.ReLU(inplace=True),
            nn.BatchNorm2d(8),
            
            # Conv block 2 - reduce to 56x56
            nn.Conv2d(8, 16, kernel_size=3, padding=1),
            nn.MaxPool2d(2),  # 112 -> 56
            nn.ReLU(inplace=True),
            nn.BatchNorm2d(16),
            
            # Conv block 3 - reduce to 28x28
            nn.Conv2d(16, 32, kernel_size=3, padding=1),
            nn.MaxPool2d(2),  # 56 -> 28
            nn.ReLU(inplace=True),
            nn.BatchNorm2d(32),
            
            # Conv block 4 - reduce to 14x14
            nn.Conv2d(32, 32, kernel_size=3, padding=1),
            nn.MaxPool2d(2),  # 28 -> 14
            nn.ReLU(inplace=True),
            nn.BatchNorm2d(32),
            
            # Conv block 5 - reduce to 7x7
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.MaxPool2d(2),  # 14 -> 7
            nn.ReLU(inplace=True),
            nn.BatchNorm2d(64),
            
            # Final feature reduction
            nn.AdaptiveAvgPool2d((1, 1)),  # Global average pooling: 7x7 -> 1x1
        )
        
        # STRATEGY 2: Keep the binary network very small
        # After conv reduction, we have 64 features (from 64 channels * 1x1)
        self.binary_layer = BinaryLinear(64, hidden_size)
        self.bn = nn.BatchNorm1d(hidden_size)
        self.dropout = nn.Dropout(dropout_rate)
        
        # Output classifier
        self.output_layer = nn.Linear(hidden_size, num_classes)
        
        # Print memory-optimized architecture details
        conv_params = sum(p.numel() for p in self.conv_reducer.parameters())
        binary_params = sum(p.numel() for p in self.binary_layer.parameters())
        output_params = sum(p.numel() for p in self.output_layer.parameters())
        total_params = conv_params + binary_params + output_params
        
        print(f"Memory-optimized 224x224 BNN created:")
        print(f"- Total parameters: {total_params:,} (vs. {3*224*224*hidden_size:,} for naive approach)")
        print(f"- Parameter reduction: {100 - (total_params / (3*224*224*hidden_size)) * 100:.2f}%")
        print(f"- Using gradient checkpointing: {USE_GRADIENT_CHECKPOINTING if 'USE_GRADIENT_CHECKPOINTING' in globals() else False}")
        
    def forward(self, x):
        # If input is already flattened, reshape it to image format
        if len(x.shape) == 2:
            x = x.view(-1, 3, 224, 224)
            
        # Process through convolutional feature extractor
        # Use checkpointing if enabled to save memory during training
        if 'USE_GRADIENT_CHECKPOINTING' in globals() and USE_GRADIENT_CHECKPOINTING and self.training:
            from torch.utils.checkpoint import checkpoint
            # Split the conv_reducer into chunks for checkpointing
            modules = list(self.conv_reducer.children())
            # Process in chunks of 4 modules
            for i in range(0, len(modules), 4):
                chunk = nn.Sequential(*modules[i:i+4])
                x = checkpoint(lambda inp: chunk(inp), x)
        else:
            x = self.conv_reducer(x)
        
        # Flatten after convolutions
        x = x.view(x.size(0), -1)
        
        # Binary layer processing
        x = self.binary_layer(x)
        x = binary_activation(x)
        x = self.bn(x)
        x = self.dropout(x)
        
        # Output layer
        return self.output_layer(x)

# Complete Memory Optimization Guide

If you're still facing memory issues even with reduced batch size and 40 images per class, try these steps in order:

## Step 1: Force CPU Mode
If your GPU simply doesn't have enough memory, you can run the entire model on CPU:
```python
device = torch.device("cpu")
```

## Step 2: Use Ultra-Small Images
Reduce image size to 64x64 or even 32x32 pixels:
```python
memory_config['image_size'] = 32
```

## Step 3: Use the Checkpointed Model
Use the memory-optimized model with checkpointing:
```python
# Use the checkpointed model instead of the original
model = BinaryNeuralNetworkCheckpointed(
    input_size=3 * image_size * image_size,
    hidden_size=memory_config['hidden_size'],
    num_classes=len(class_names),
    dropout_rate=0.3
)
```

## Step 4: Drastically Limit Dataset Size
Reduce to even fewer images per class:
```python
memory_config['max_samples_per_class'] = 10  # Use only 10 images per class
```

## Step 5: Disable Mixed Precision
Mixed precision can sometimes cause issues:
```python
memory_config['use_mixed_precision'] = False
```

## Step 6: Advanced Options
For extreme cases:
1. Disable gradient computation for most layers
2. Use 16-bit floats (half precision) throughout
3. Try training only the final layer and freezing the rest

## Step 7: External Processing
If all else fails:
1. Pre-process and extract features on CPU
2. Save features to disk
3. Load and train only on the extracted features

In [16]:
# Memory-Efficient Feature Preprocessing for 224x224 Images
# This cell provides functions to preprocess images and extract features before training

def create_feature_extractor_for_224():
    """Create a lightweight feature extractor for 224x224 images"""
    import torchvision.models as models
    from torch.utils.checkpoint import checkpoint_sequential
    
    # Use a pretrained model with the head removed
    # MobileNet is much more memory efficient than other models
    model = models.mobilenet_v2(pretrained=True)
    
    # Remove the classifier to get features only
    feature_extractor = torch.nn.Sequential(*(list(model.features)))
    
    # Freeze the feature extractor to save memory during training
    for param in feature_extractor.parameters():
        param.requires_grad = False
        
    # Move to appropriate device
    feature_extractor = feature_extractor.to(device)
    feature_extractor.eval()
    
    return feature_extractor

def extract_features_from_dataset(dataset, feature_extractor, batch_size=4):
    """Extract features from a dataset using a pre-trained model to reduce training memory"""
    from torch.utils.data import DataLoader, TensorDataset
    import time
    
    # Create a dataloader for the dataset
    loader = DataLoader(dataset, batch_size=batch_size, shuffle=False)
    
    all_features = []
    all_labels = []
    
    # Process in batches
    print(f"Extracting features from {len(dataset)} images...")
    start_time = time.time()
    
    with torch.no_grad():
        for batch_idx, (images, labels) in enumerate(loader):
            if batch_idx % 10 == 0:
                print(f"Processing batch {batch_idx+1}/{len(loader)}...")
            
            # Move to device
            images = images.to(device)
            
            # Extract features
            features = feature_extractor(images)
            
            # Global average pooling to get a fixed-size feature vector
            features = torch.nn.functional.adaptive_avg_pool2d(features, (1, 1))
            features = features.view(features.size(0), -1)
            
            # Store features and labels
            all_features.append(features.cpu())
            all_labels.append(labels)
            
            # Clean up to save memory
            del images, features
            torch.cuda.empty_cache()
    
    # Concatenate all features and labels
    all_features = torch.cat(all_features, dim=0)
    all_labels = torch.cat(all_labels, dim=0)
    
    # Create a new TensorDataset with the extracted features
    feature_dataset = TensorDataset(all_features, all_labels)
    
    print(f"Feature extraction completed in {time.time() - start_time:.2f} seconds")
    print(f"Extracted feature shape: {all_features.shape}")
    
    return feature_dataset

# Example usage (uncomment to run):
# feature_extractor = create_feature_extractor_for_224()
# train_feature_dataset = extract_features_from_dataset(train_dataset, feature_extractor)
# test_feature_dataset = extract_features_from_dataset(test_dataset, feature_extractor)

# Then create new data loaders with the feature datasets:
# train_feature_loader = DataLoader(train_feature_dataset, batch_size=memory_config['batch_size'], shuffle=True)
# test_feature_loader = DataLoader(test_feature_dataset, batch_size=memory_config['batch_size'], shuffle=False)

# Complete Guide to Training with 224x224 Images

This guide provides multiple strategies to train your model while maintaining 224x224 image size.

## Strategy 1: Use Convolutional Architecture
The `BNN224x224MemoryOptimized` class uses convolutional layers instead of flattening the entire image, which drastically reduces memory usage while maintaining the 224x224 input size.

```python
# Create the memory-optimized model for 224x224
model = BNN224x224MemoryOptimized(
    hidden_size=64,  # Smaller hidden size
    num_classes=len(class_names),
    dropout_rate=0.5  # Aggressive dropout
).to(device)

# Continue with training as normal using this model
```

## Strategy 2: Feature Pre-extraction
Extract features once using a pre-trained model, then train on these features:

```python
# 1. Create feature extractor
feature_extractor = create_feature_extractor_for_224()

# 2. Process train and test datasets
train_feature_dataset = extract_features_from_dataset(train_dataset, feature_extractor)
test_feature_dataset = extract_features_from_dataset(test_dataset, feature_extractor)

# 3. Create data loaders for the feature datasets
train_feature_loader = DataLoader(train_feature_dataset, batch_size=32, shuffle=True)
test_feature_loader = DataLoader(test_feature_dataset, batch_size=32, shuffle=False)

# 4. Create a simple model that works with extracted features
feature_model = nn.Sequential(
    nn.Linear(1280, 256),  # MobileNetV2 features have 1280 dimensions
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(256, len(class_names))
).to(device)

# 5. Train as normal using the feature_model and train_feature_loader
```

## Strategy 3: Train in CPU Mode
If GPU memory is still an issue, force CPU mode:

```python
device = torch.device("cpu")
model = BNN224x224MemoryOptimized(...).to(device)
```

## Strategy 4: Advanced PyTorch Memory Optimization

```python
# 1. Use torch.compile for efficiency (PyTorch 2.0+)
if hasattr(torch, 'compile'):
    model = torch.compile(model)  # Uses dynamic shape tracing

# 2. Disable gradient history for non-essential layers
def set_requires_grad(model, requires_grad=False):
    for param in model.parameters():
        param.requires_grad = requires_grad

# Keep only the last layer trainable
set_requires_grad(model, False)
set_requires_grad(model.output_layer, True)

# 3. Use 16-bit precision throughout
model.half()  # Convert model to half precision
```

## Strategy 5: Activate NVIDIA Memory-Efficient Features
For NVIDIA GPUs, set these environment variables:
```python
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0"  # Use only the first GPU
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb=32,garbage_collection_threshold=0.8'
torch.backends.cudnn.benchmark = False
```

In [None]:
# Ultra Memory-Efficient Configuration Example
# Run this cell if you're still experiencing memory issues

# 1. Force CPU mode if needed (uncomment to use)
# device = torch.device("cpu")
# print(f"Forced CPU mode: {device}")

# 2. Apply ultra-efficient memory config
ultra_config = {
    'batch_size': 1,
    'gradient_accumulation_steps': 16,  # More gradient accumulation steps
    'gc_frequency': 1,
    'memory_efficient': True,
    'use_mixed_precision': False,  # Disable mixed precision if on CPU
    'image_size': 64,  # Drastically reduced image size
    'hidden_size': 64,
    'embedding_size': 128,
    'max_samples_per_class': 10,  # Ultra-limited for testing
    'num_workers': 0,
    'pin_memory': False,
}

# Uncomment to apply this config
# memory_config.update(ultra_config)
# print("Ultra-efficient memory config applied!")

# 3. Create checkpointed model example
def create_ultralight_model():
    """Create an ultra-memory efficient model"""
    input_size = 3 * memory_config.get('image_size', 64) * memory_config.get('image_size', 64)
    model = BinaryNeuralNetworkCheckpointed(
        input_size=input_size,
        hidden_size=memory_config.get('hidden_size', 64),
        num_classes=len(class_names) if 'class_names' in globals() else 3,
        dropout_rate=0.5  # More dropout for regularization
    )
    return model.to(device)

# Example of creating the model (uncomment to use)
# model = create_ultralight_model()
# print(f"Ultra-light model created with {sum(p.numel() for p in model.parameters())} parameters")

: 

In [None]:
# Instantiate the BNN Model
# Calculate the input size based on the image size
input_size = 3 * image_size * image_size  # For RGB images (3 channels)
hidden_size = memory_config['hidden_size']
embedding_size = memory_config['embedding_size']
num_hidden_layers = memory_config.get('num_hidden_layers', 1)

# Get the number of classes from the dataset
num_classes = len(class_names)

# Create an instance of the BNN model
model = BinaryNeuralNetwork(
    input_size=input_size,
    hidden_size=hidden_size,
    num_classes=num_classes,
    num_hidden_layers=num_hidden_layers,
    embedding_size=embedding_size,
    dropout_rate=0.2
)

# Move model to device (GPU if available)
model = model.to(device)

# Print model architecture summary
print(f"Binary Neural Network Architecture:")
print(f"- Input size: {input_size} (3 channels × {image_size} × {image_size})")
print(f"- Hidden size: {hidden_size}")
print(f"- Embedding size: {embedding_size}")
print(f"- Hidden layers: {num_hidden_layers}")
print(f"- Output classes: {num_classes} ({class_names})")
print(f"- Device: {device}")

# Count the number of parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"\nTotal parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")

# Print the full model architecture for reference
print("\nDetailed model architecture:")
print(model)

Progressive dimension reduction: 150528 → 16384 → 4096 → 2048 → 1024 → 512


In [None]:
# Configure Training Parameters
import torch.optim as optim
from torch.optim.lr_scheduler import ReduceLROnPlateau

# Loss function - CrossEntropyLoss for multi-class classification
criterion = nn.CrossEntropyLoss()

# Optimizer - Adam with a relatively low learning rate for stability
learning_rate = 0.001
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Learning rate scheduler - reduce on plateau
scheduler = ReduceLROnPlateau(
    optimizer, 
    mode='min',  # Minimize loss
    factor=0.5,  # Reduce LR by half when triggered
    patience=3,  # Wait 3 epochs for improvement
    verbose=True,
    min_lr=0.00001  # Don't go below this learning rate
)

# Training parameters
num_epochs = 20  # SET THE NUMBER OF EPOCHS HERE
early_stopping_patience = 5  # Stop training if no improvement for this many epochs

# Gradient accumulation steps from memory config
gradient_accumulation_steps = memory_config.get('gradient_accumulation_steps', 1)

# Print training configuration
print(f"Training Configuration:")
print(f"- Number of epochs: {num_epochs}")
print(f"- Learning rate: {learning_rate}")
print(f"- Loss function: CrossEntropyLoss")
print(f"- Optimizer: Adam")
print(f"- Scheduler: ReduceLROnPlateau (factor=0.5, patience=3)")
print(f"- Early stopping patience: {early_stopping_patience}")
print(f"- Gradient accumulation steps: {gradient_accumulation_steps}")
print(f"- Effective batch size: {batch_size * gradient_accumulation_steps}")
print(f"- Mixed precision: {use_mixed_precision}")
print(f"- Memory efficient: {memory_efficient}")
print(f"- GC frequency: {gc_frequency}")
print(f"\nTraining will begin with {num_epochs} epochs...")

In [None]:
# Train the Model
import time
import matplotlib.pyplot as plt

# Record the start time for overall training
start_time = time.time()

print("Starting model training...")
print(f"Training on {len(train_dataset)} samples, validating on {len(test_dataset)} samples")

# Use the memory-efficient training function defined earlier
training_results = train_memory_efficient(
    model=model,
    train_loader=train_loader,
    criterion=criterion,
    optimizer=optimizer,
    num_epochs=num_epochs,
    device=device,
    scheduler=scheduler,
    gradient_accumulation_steps=gradient_accumulation_steps,
    memory_efficient=memory_efficient,
    gc_frequency=gc_frequency,
    use_mixed_precision=use_mixed_precision,
    early_stopping_patience=early_stopping_patience
)

# Calculate total training time
total_time = time.time() - start_time
hours, remainder = divmod(total_time, 3600)
minutes, seconds = divmod(remainder, 60)

print(f"\nTraining completed in {int(hours):02d}:{int(minutes):02d}:{int(seconds):02d}")
print(f"Final loss: {training_results['final_loss']:.4f}")
print(f"Final accuracy: {training_results['final_accuracy']:.2f}%")

# Plot training metrics
plt.figure(figsize=(14, 5))

# Plot loss
plt.subplot(1, 2, 1)
plt.plot(training_results['train_losses'], 'b-', label='Training Loss')
plt.title('Training Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)

# Plot accuracy
plt.subplot(1, 2, 2)
plt.plot(training_results['train_accuracies'], 'g-', label='Training Accuracy')
plt.title('Training Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy (%)')
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()

# Save the training history to CSV
import pandas as pd
history_df = pd.DataFrame(training_results['history'])
history_df.to_csv('results/bnn_224x224_training_history.csv', index=False)
print("Training history saved to results/bnn_224x224_training_history.csv")

In [None]:
# Evaluate the Trained Model
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

print("Evaluating model on test dataset...")

# Use the memory-efficient evaluation function
eval_results = evaluate_bnn_memory_efficient(
    model=model,
    test_loader=test_loader,
    criterion=criterion,
    device=device,
    class_names=class_names,
    use_mixed_precision=use_mixed_precision
)

# Print evaluation metrics
print(f"\nEvaluation Results:")
print(f"Test Loss: {eval_results['loss']:.4f}")
print(f"Test Accuracy: {eval_results['accuracy']:.2f}%")
print(f"Macro Precision: {eval_results['macro_precision']:.4f}")
print(f"Macro Recall: {eval_results['macro_recall']:.4f}")
print(f"Macro F1 Score: {eval_results['macro_f1']:.4f}")
print(f"\nPer-Class Accuracy:")
for class_name, acc in eval_results['class_accuracy'].items():
    print(f"  {class_name}: {acc:.2f}%")

# Plot confusion matrix
plt.figure(figsize=(10, 8))
cm = eval_results['confusion_matrix']
sns.heatmap(cm, annot=True, fmt='g', cmap='Blues', xticklabels=class_names, yticklabels=class_names)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.tight_layout()
plt.show()

# Save evaluation results
import json
import numpy as np

# Convert numpy arrays to lists for JSON serialization
serializable_results = {k: (v.tolist() if isinstance(v, np.ndarray) else v) 
                        for k, v in eval_results.items()
                        if k not in ['samples', 'roc_data']}  # Exclude image data

with open('results/bnn_224x224_evaluation_results.json', 'w') as f:
    json.dump(serializable_results, f, indent=2)
    
print("Evaluation results saved to results/bnn_224x224_evaluation_results.json")

# Save the trained model
model_save_path = 'results/bnn_224x224_model.pth'
torch.save({
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'class_names': class_names,
    'config': {
        'input_size': input_size,
        'hidden_size': hidden_size,
        'embedding_size': embedding_size,
        'num_hidden_layers': num_hidden_layers,
        'num_classes': num_classes,
        'image_size': image_size
    }
}, model_save_path)

print(f"Trained model saved to {model_save_path}")
print("\nTraining and evaluation complete!")

# Advanced Memory Optimization Guide for 224x224 Images

This guide provides the most extreme memory optimization techniques that maintain 224x224 image resolution for high-detail plant disease classification.

## Technique 1: Patch-Based Processing

Process images in smaller patches rather than whole images:

```python
def patch_based_inference(model, image, patch_size=112, stride=56):
    """Process a 224x224 image in smaller patches to reduce memory usage"""
    # Implementation:
    # 1. Split the image into overlapping patches
    # 2. Process each patch separately
    # 3. Combine the results (averaging overlapping areas)
    # This allows processing high-res images with much less memory
    patches = []
    h, w = image.shape[2], image.shape[3]
    
    # Extract patches
    for y in range(0, h-patch_size+1, stride):
        for x in range(0, w-patch_size+1, stride):
            patch = image[:, :, y:y+patch_size, x:x+patch_size]
            patches.append(patch)
    
    # Process each patch
    patch_features = []
    for patch in patches:
        with torch.no_grad():
            feature = model.conv_reducer(patch)  # Use only the feature extractor part
            patch_features.append(feature)
    
    # Combine patch features
    combined = torch.cat(patch_features, dim=1)
    return combined
```

## Technique 2: Advanced PyTorch Memory Management

```python
# Set stricter memory limits to avoid OOM
torch.cuda.set_per_process_memory_fraction(0.7)  # Use only 70% of available GPU memory
torch.cuda.empty_cache()

# Use custom CUDA allocator settings
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb=32,garbage_collection_threshold=0.6'

# Enable anomaly detection for better error messages
torch.autograd.set_detect_anomaly(True)
```

## Technique 3: Ultra-Efficient Data Processing

1. Use on-the-fly image loading with caching
2. Implement lazy dataset that loads images only when needed
3. Compress images during processing

```python
class MemoryEfficientImageDataset(Dataset):
    def __init__(self, root_dir, transform=None, cache_size=100):
        self.root_dir = root_dir
        self.transform = transform
        self.samples = []
        self.cache = {}
        self.cache_size = cache_size
        
        # Build index of samples without loading images
        for class_idx, class_name in enumerate(os.listdir(root_dir)):
            class_dir = os.path.join(root_dir, class_name)
            if os.path.isdir(class_dir):
                for img_name in os.listdir(class_dir):
                    img_path = os.path.join(class_dir, img_name)
                    if os.path.isfile(img_path):
                        self.samples.append((img_path, class_idx))
    
    def __len__(self):
        return len(self.samples)
    
    def __getitem__(self, idx):
        img_path, label = self.samples[idx]
        
        # Check cache first
        if img_path in self.cache:
            img = self.cache[img_path]
        else:
            # Load image
            img = Image.open(img_path).convert('RGB')
            
            # Cache management - limit size
            if len(self.cache) >= self.cache_size:
                # Remove random item from cache
                key_to_remove = next(iter(self.cache))
                del self.cache[key_to_remove]
            
            # Add to cache
            self.cache[img_path] = img
        
        # Apply transformations
        if self.transform:
            img = self.transform(img)
        
        return img, label
```

## Technique 4: Layer-by-Layer Training

Train the model in stages to reduce peak memory usage:

1. Train only the convolutional features first
2. Freeze the features and train the binary network
3. Fine-tune the entire model with reduced learning rate

## Technique 5: Use TorchScript or ONNX for Inference

Convert the model to a more memory-efficient format for inference:

```python
# TorchScript conversion
scripted_model = torch.jit.script(model)

# Or ONNX export
dummy_input = torch.randn(1, 3, 224, 224, device=device)
torch.onnx.export(model, dummy_input, "bnn_224x224.onnx")
```

## Technique 6: Memory Profiling and Optimization

Use memory profiling to identify bottlenecks:

```python
# Memory profiling example
from torch.utils.profiler import profile, record_function

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
             profile_memory=True, record_shapes=True) as prof:
    with record_function("model_inference"):
        model(inputs)

print(prof.key_averages().table(sort_by="cuda_memory_usage", row_limit=10))
```

The results will help you identify exactly which operations are consuming the most memory so you can optimize them specifically.

In [None]:
# Ultimate Memory-Optimized CNN-BNN Hybrid Model for 224x224 Images

class ConvBNN224x224(nn.Module):
    """
    Ultra memory-efficient CNN-BNN hybrid model specifically for 224x224 images.
    Uses a combination of techniques to minimize memory usage while preserving full resolution.
    
    Key memory optimization techniques:
    1. Depth-wise separable convolutions to reduce parameters
    2. Progressive feature reduction
    3. Gradient checkpointing on all heavy layers
    4. Binary activations in later stages
    5. Minimal fully connected layers
    """
    def __init__(self, num_classes=3, dropout_rate=0.5):
        super(ConvBNN224x224, self).__init__()
        
        # Import torch utils for checkpointing
        if hasattr(torch, 'utils') and hasattr(torch.utils, 'checkpoint'):
            self.use_checkpointing = True
            print("Gradient checkpointing enabled for memory efficiency")
        else:
            self.use_checkpointing = False
            print("Gradient checkpointing not available in this PyTorch version")
        
        # Memory-efficient convolutional architecture
        self.features = nn.Sequential(
            # Stage 1: Initial feature extraction (224x224 -> 112x112)
            nn.Conv2d(3, 16, kernel_size=3, stride=2, padding=1, bias=False),
            nn.BatchNorm2d(16),
            nn.ReLU(inplace=True),
            
            # Stage 2: Depthwise separable block 1 (112x112 -> 56x56)
            nn.Conv2d(16, 16, kernel_size=3, stride=1, padding=1, groups=16, bias=False),  # Depthwise
            nn.Conv2d(16, 32, kernel_size=1, bias=False),  # Pointwise
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),
            
            # Stage 3: Depthwise separable block 2 (56x56 -> 28x28)
            nn.Conv2d(32, 32, kernel_size=3, stride=1, padding=1, groups=32, bias=False),  # Depthwise
            nn.Conv2d(32, 64, kernel_size=1, bias=False),  # Pointwise
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),
            
            # Stage 4: Depthwise separable block 3 (28x28 -> 14x14)
            nn.Conv2d(64, 64, kernel_size=3, stride=1, padding=1, groups=64, bias=False),  # Depthwise
            nn.Conv2d(64, 128, kernel_size=1, bias=False),  # Pointwise
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),
            
            # Stage 5: Final extraction (14x14 -> 7x7)
            nn.Conv2d(128, 128, kernel_size=3, stride=1, padding=1, groups=128, bias=False),  # Depthwise
            nn.Conv2d(128, 256, kernel_size=1, bias=False),  # Pointwise
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),
            
            # Stage 6: Adaptive pooling to fixed size
            nn.AdaptiveAvgPool2d((1, 1))
        )
        
        # Binary classification stage
        self.binary_classifier = nn.Sequential(
            # Binary layer
            BinaryLinear(256, 128),
            nn.BatchNorm1d(128),
            nn.Dropout(dropout_rate)
        )
        
        # Final classifier
        self.classifier = nn.Linear(128, num_classes)
        
        # Initialize weights properly
        self._initialize_weights()
        
    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
    
    def _run_stage(self, x, stage):
        """Helper method to run a stage with gradient checkpointing if enabled"""
        if self.use_checkpointing and self.training:
            return torch.utils.checkpoint.checkpoint(stage, x)
        return stage(x)
    
    def forward(self, x):
        # Ensure input is correct shape
        if len(x.shape) == 2:
            # If flattened, reshape to image format
            x = x.view(-1, 3, 224, 224)
        
        # Process through feature extractor with checkpointing
        # Split the features into chunks for more efficient checkpointing
        modules = list(self.features.children())
        stages = [
            nn.Sequential(*modules[0:3]),   # Stage 1 
            nn.Sequential(*modules[3:8]),   # Stage 2
            nn.Sequential(*modules[8:13]),  # Stage 3
            nn.Sequential(*modules[13:18]), # Stage 4
            nn.Sequential(*modules[18:23]), # Stage 5
            modules[23]                     # Stage 6 (pooling)
        ]
        
        # Process through each stage with memory-efficient checkpointing
        for stage in stages:
            x = self._run_stage(x, stage)
        
        # Flatten
        x = x.view(x.size(0), -1)
        
        # Binary classification stage
        x = self.binary_classifier(x)
        x = binary_activation(x)  # Apply binary activation
        
        # Final classification
        return self.classifier(x)
        
    def predict_proba(self, x):
        """Get class probabilities using softmax"""
        with torch.no_grad():
            logits = self.forward(x)
            probabilities = F.softmax(logits, dim=1)
        return probabilities
    
    def predict(self, x):
        """Get predicted class labels"""
        with torch.no_grad():
            logits = self.forward(x)
            predictions = torch.argmax(logits, dim=1)
        return predictions

# Create model with hyperparameter control for even more memory efficiency
def create_memory_optimized_convbnn(num_classes, dropout_rate=0.5):
    """Factory function to create a memory-optimized ConvBNN model"""
    model = ConvBNN224x224(
        num_classes=num_classes,
        dropout_rate=dropout_rate
    )
    
    # Print model summary
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    
    print(f"ConvBNN224x224 Model Summary:")
    print(f"- Total parameters: {total_params:,}")
    print(f"- Trainable parameters: {trainable_params:,}")
    print(f"- Memory-optimized for 224x224 images")
    
    return model

# Example usage:
# conv_bnn_model = create_memory_optimized_convbnn(num_classes=len(class_names)).to(device)
# optimizer = optim.Adam(conv_bnn_model.parameters(), lr=0.001)
# Use the same training procedure as before with this model

In [None]:
# Complete 224x224 Training Recipe with Extreme Memory Optimization
# This cell provides a complete recipe for training with 224x224 images

def train_with_224x224(num_samples_per_class=40):
    """
    Complete recipe function to train a model with 224x224 images
    with extreme memory optimization techniques.
    """
    print("Starting 224x224 training recipe with extreme memory optimization...")
    
    # 1. Prepare memory settings
    memory_config_224 = {
        'image_size': 224,  # Keep 224x224 resolution
        'batch_size': 1,    # Minimum batch size
        'gradient_accumulation_steps': 32,  # Aggressive accumulation
        'gc_frequency': 1,  # Maximum garbage collection
        'memory_efficient': True,
        'use_mixed_precision': True,
        'max_samples_per_class': num_samples_per_class,
        'hidden_size': 64,  # Minimal hidden units
    }
    
    print("\n1. Setting memory configuration for 224x224 images...")
    memory_config.update(memory_config_224)
    image_size = memory_config['image_size']  # Should be 224
    
    # 2. Apply aggressive CUDA memory optimizations
    print("\n2. Applying aggressive CUDA memory optimizations...")
    os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True,max_split_size_mb:32'
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.enabled = True  # Keep enabled for performance
    
    # Free up all memory before starting
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    gc.collect()
    
    # 3. Define optimized data loading
    print("\n3. Creating memory-efficient data loading pipeline...")
    transform = transforms.Compose([
        transforms.Resize((224, 224)),  # Keep full resolution
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])
    
    # Load data with sample limit
    plant_dataset, class_names = load_plant_disease_dataset(
        dataset_path,
        image_size=224,
        memory_efficient=True,
        max_samples_per_class=memory_config['max_samples_per_class']
    )
    
    # Create train/test split
    total_size = len(plant_dataset)
    test_size = int(0.2 * total_size)
    train_size = total_size - test_size
    train_dataset, test_dataset = random_split(plant_dataset, [train_size, test_size])
    
    # Create data loaders
    train_loader = DataLoader(
        train_dataset,
        batch_size=memory_config['batch_size'],
        shuffle=True,
        num_workers=0,
        pin_memory=False
    )
    
    test_loader = DataLoader(
        test_dataset,
        batch_size=memory_config['batch_size'],
        shuffle=False,
        num_workers=0,
        pin_memory=False
    )
    
    print(f"  - Training samples: {len(train_dataset)}")
    print(f"  - Testing samples: {len(test_dataset)}")
    
    # 4. Create optimized model
    print("\n4. Creating memory-optimized ConvBNN model for 224x224 images...")
    model = ConvBNN224x224(num_classes=len(class_names), dropout_rate=0.5).to(device)
    
    # 5. Configure training with memory optimization
    print("\n5. Configuring memory-optimized training...")
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)  # Add weight decay
    scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=2)
    
    # Training parameters
    num_epochs = 15  # Reduced epochs for testing
    gradient_accumulation_steps = memory_config['gradient_accumulation_steps']
    
    print(f"  - Epochs: {num_epochs}")
    print(f"  - Gradient accumulation: {gradient_accumulation_steps} steps")
    print(f"  - Effective batch size: {memory_config['batch_size'] * gradient_accumulation_steps}")
    
    # 6. Train with memory monitoring
    print("\n6. Starting memory-optimized training...")
    if torch.cuda.is_available():
        print(f"Initial GPU memory: {torch.cuda.memory_allocated() / (1024**3):.3f} GB")
    
    # Train with memory-efficient function
    training_results = train_memory_efficient(
        model=model,
        train_loader=train_loader,
        criterion=criterion,
        optimizer=optimizer,
        num_epochs=num_epochs,
        device=device,
        scheduler=scheduler,
        gradient_accumulation_steps=gradient_accumulation_steps,
        memory_efficient=True,
        gc_frequency=memory_config['gc_frequency'],
        use_mixed_precision=memory_config['use_mixed_precision'],
        early_stopping_patience=5
    )
    
    # 7. Evaluate and save
    print("\n7. Evaluating model...")
    eval_results = evaluate_bnn_memory_efficient(
        model=model,
        test_loader=test_loader,
        criterion=criterion,
        device=device,
        class_names=class_names,
        use_mixed_precision=memory_config['use_mixed_precision']
    )
    
    print(f"\nTraining completed successfully!")
    print(f"  - Final loss: {training_results['final_loss']:.4f}")
    print(f"  - Final accuracy: {training_results['final_accuracy']:.2f}%")
    print(f"  - Test accuracy: {eval_results['accuracy']:.2f}%")
    
    # 8. Save model
    save_path = f"results/convbnn_224x224_samples{num_samples_per_class}_model.pth"
    torch.save({
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'class_names': class_names,
    }, save_path)
    print(f"\nModel saved to {save_path}")
    
    return model, training_results, eval_results

# To run the complete recipe, uncomment the following line:
# model_224, train_results_224, eval_results_224 = train_with_224x224(num_samples_per_class=40)

In [None]:
# Advanced Memory Optimization Techniques for 224x224 Images
# These techniques go beyond basic memory saving methods

import torch
import os
import gc
import numpy as np
from torch.cuda.amp import autocast

class AdvancedMemoryOptimizer:
    """Class providing cutting-edge memory optimization techniques for 224x224 images"""
    
    @staticmethod
    def configure_pytorch_memory(fraction=0.8, min_block_size=32):
        """Configure PyTorch memory management settings"""
        # Apply advanced CUDA memory settings
        os.environ['PYTORCH_CUDA_ALLOC_CONF'] = f'expandable_segments:True,max_split_size_mb:{min_block_size}'
        
        if torch.cuda.is_available():
            # Limit memory fraction used
            torch.cuda.set_per_process_memory_fraction(fraction)
            
            # Empty cache
            torch.cuda.empty_cache()
            
            # Force garbage collection
            gc.collect()
            
            # Print current memory status
            allocated = torch.cuda.memory_allocated() / (1024 ** 3)
            reserved = torch.cuda.memory_reserved() / (1024 ** 3)
            print(f"Memory status after optimization:")
            print(f"- Allocated: {allocated:.3f} GB")
            print(f"- Reserved: {reserved:.3f} GB")
            
            return True
        else:
            print("CUDA not available - no memory optimization applied")
            return False

    @staticmethod
    def apply_tensor_compression(model):
        """Apply tensor compression techniques to model parameters"""
        print("Applying tensor compression...")
        
        # Count original parameters
        original_params = sum(p.numel() for p in model.parameters())
        
        # Apply parameter quantization where possible (not on binary layers)
        for name, module in model.named_modules():
            if isinstance(module, torch.nn.Conv2d) or isinstance(module, torch.nn.Linear):
                # Skip binary layers
                if 'binary' not in name.lower():
                    # Convert to float16 (half precision)
                    module.weight.data = module.weight.data.half().float()
                    if module.bias is not None:
                        module.bias.data = module.bias.data.half().float()
        
        # Force single precision for BatchNorm to avoid training issues
        for m in model.modules():
            if isinstance(m, torch.nn.BatchNorm1d) or isinstance(m, torch.nn.BatchNorm2d):
                m.float()
        
        print(f"Tensor compression applied to non-binary layers")
        print(f"Original parameter count: {original_params:,}")
        return model
    
    @staticmethod
    def create_patch_dataset_loader(dataset, image_size=224, patch_size=112, batch_size=4):
        """Create a dataset that processes images as patches to save memory"""
        from torch.utils.data import Dataset, DataLoader
        
        class PatchDataset(Dataset):
            def __init__(self, original_dataset, patch_size=112, stride=56, transforms=None):
                self.dataset = original_dataset
                self.patch_size = patch_size
                self.stride = stride
                self.transforms = transforms
                self.patch_indices = []
                
                # For each image, pre-compute patch locations
                for img_idx in range(len(self.dataset)):
                    # Assuming each image is square with size image_size
                    for y in range(0, image_size - patch_size + 1, stride):
                        for x in range(0, image_size - patch_size + 1, stride):
                            self.patch_indices.append((img_idx, y, x))
                
                print(f"Created patch dataset with {len(self.patch_indices)} patches from {len(self.dataset)} images")
                print(f"- Patch size: {patch_size}x{patch_size}")
                print(f"- Stride: {stride}")
                
            def __len__(self):
                return len(self.patch_indices)
                
            def __getitem__(self, idx):
                img_idx, y, x = self.patch_indices[idx]
                img, label = self.dataset[img_idx]
                
                # Extract patch - handle both tensor and PIL image cases
                if isinstance(img, torch.Tensor):
                    if len(img.shape) == 3:  # C,H,W format
                        patch = img[:, y:y+self.patch_size, x:x+self.patch_size]
                    else:  # Assume it's already flattened
                        patch = img  # Can't extract patch from flattened image
                else:
                    # Assume PIL image
                    patch = img.crop((x, y, x+self.patch_size, y+self.patch_size))
                    if self.transforms:
                        patch = self.transforms(patch)
                
                return patch, label
        
        # Create patch dataset and loader
        patch_dataset = PatchDataset(dataset, patch_size=patch_size)
        patch_loader = DataLoader(
            patch_dataset, 
            batch_size=batch_size,
            shuffle=True,
            num_workers=0,
            pin_memory=False
        )
        
        return patch_loader
    
    @staticmethod
    def freeze_early_layers(model, freeze_fraction=0.7):
        """Freeze early layers of the model to reduce memory usage during training"""
        # Get all parameters
        params = list(model.named_parameters())
        
        # Calculate how many parameters to freeze
        freeze_count = int(len(params) * freeze_fraction)
        
        # Freeze early layers
        frozen_params = 0
        for name, param in params[:freeze_count]:
            param.requires_grad = False
            frozen_params += param.numel()
        
        # Print status
        total_params = sum(p.numel() for p in model.parameters())
        trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
        
        print(f"Layer freezing applied:")
        print(f"- Total parameters: {total_params:,}")
        print(f"- Frozen parameters: {frozen_params:,} ({frozen_params/total_params*100:.1f}%)")
        print(f"- Trainable parameters: {trainable_params:,} ({trainable_params/total_params*100:.1f}%)")
        
        return model
    
    @staticmethod
    def optimize_training_loop(model, train_loader, criterion, optimizer, device, num_epochs=10):
        """Highly optimized training loop with extreme memory efficiency for 224x224 images"""
        # Import all needed modules inside function to avoid memory leaks
        import torch
        import time
        import gc
        from torch.cuda.amp import autocast, GradScaler
        
        # Initialize scaler for mixed precision
        scaler = GradScaler()
        
        # Track metrics
        history = []
        
        # Training loop
        for epoch in range(num_epochs):
            epoch_start = time.time()
            running_loss = 0.0
            correct = 0
            total = 0
            
            # Set to train mode
            model.train()
            
            # Reset gradients
            optimizer.zero_grad()
            
            # Process batches
            for batch_idx, (inputs, targets) in enumerate(train_loader):
                # Move to device
                inputs, targets = inputs.to(device), targets.to(device)
                
                # Forward with mixed precision
                with autocast():
                    outputs = model(inputs)
                    loss = criterion(outputs, targets)
                
                # Backward with gradient scaling
                scaler.scale(loss).backward()
                scaler.step(optimizer)
                scaler.update()
                optimizer.zero_grad()
                
                # Update metrics
                running_loss += loss.item()
                _, predicted = outputs.max(1)
                total += targets.size(0)
                correct += predicted.eq(targets).sum().item()
                
                # Aggressive memory cleanup after each batch
                del inputs, outputs, predicted
                torch.cuda.empty_cache()
                gc.collect()
                
                # Print progress
                if batch_idx % 10 == 0:
                    print(f"Epoch {epoch+1}/{num_epochs} | Batch {batch_idx+1}/{len(train_loader)} | "
                          f"Loss: {loss.item():.4f}")
            
            # Calculate epoch metrics
            epoch_loss = running_loss / len(train_loader)
            epoch_acc = 100.0 * correct / total
            epoch_time = time.time() - epoch_start
            
            # Record history
            history.append({
                'epoch': epoch + 1,
                'loss': epoch_loss,
                'accuracy': epoch_acc,
                'time': epoch_time
            })
            
            # Print epoch summary
            print(f"Epoch {epoch+1}/{num_epochs} | Loss: {epoch_loss:.4f} | "
                  f"Accuracy: {epoch_acc:.2f}% | Time: {epoch_time:.2f}s")
            
            # Clean up memory at epoch end
            gc.collect()
            torch.cuda.empty_cache()
        
        return history

# Example usage of advanced memory techniques:
# 1. Configure PyTorch memory
# optimizer = AdvancedMemoryOptimizer()
# optimizer.configure_pytorch_memory(fraction=0.7)

# 2. Create or load your model
# model = ConvBNN224x224(num_classes=len(class_names)).to(device)

# 3. Apply tensor compression
# model = optimizer.apply_tensor_compression(model)

# 4. Freeze early layers
# model = optimizer.freeze_early_layers(model, freeze_fraction=0.7)

# 5. Create patch-based dataset loader (optional)
# patch_loader = optimizer.create_patch_dataset_loader(train_dataset, patch_size=112)

# 6. Use the optimized training loop
# history = optimizer.optimize_training_loop(model, train_loader, criterion, optimizer, device, num_epochs=10)

print("Advanced memory optimization techniques loaded and ready to use")

In [None]:
# Feature Extraction Pipeline for 224x224 Images
# This cell provides a complete pipeline to extract features from 224x224 images
# and train a lightweight BNN on those features

class FeatureExtractionPipeline:
    """Complete pipeline for memory-efficient feature extraction and BNN training for 224x224 images"""
    
    def __init__(self, dataset_path, batch_size=4, num_workers=0, feature_dim=1280, feature_save_dir='features'):
        self.dataset_path = dataset_path
        self.batch_size = batch_size
        self.num_workers = num_workers
        self.feature_dim = feature_dim
        self.feature_save_dir = feature_save_dir
        
        # Create directory for saving features
        os.makedirs(feature_save_dir, exist_ok=True)
        
        # Initialize state
        self.feature_extractor = None
        self.train_dataset = None
        self.test_dataset = None
        self.class_names = None
        self.train_features = None
        self.test_features = None
        self.train_labels = None
        self.test_labels = None
        
        print(f"Feature extraction pipeline initialized")
        print(f"Features will be saved to {os.path.abspath(feature_save_dir)}")
    
    def setup_data(self, max_samples_per_class=None, image_size=224):
        """Load and prepare the image dataset without loading all images into memory"""
        from torchvision import datasets, transforms
        
        # Define transforms
        transform = transforms.Compose([
            transforms.Resize((image_size, image_size)),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
        ])
        
        # Load dataset without loading all images
        full_dataset = datasets.ImageFolder(root=self.dataset_path, transform=transform)
        self.class_names = full_dataset.classes
        
        # Limit samples per class if specified
        if max_samples_per_class is not None:
            class_indices = {}
            for idx, (_, class_idx) in enumerate(full_dataset.samples):
                if class_idx not in class_indices:
                    class_indices[class_idx] = []
                if len(class_indices[class_idx]) < max_samples_per_class:
                    class_indices[class_idx].append(idx)
            
            # Flatten indices list
            limited_indices = [idx for indices in class_indices.values() for idx in indices]
            from torch.utils.data import Subset
            full_dataset = Subset(full_dataset, limited_indices)
            print(f"Dataset limited to {max_samples_per_class} samples per class ({len(limited_indices)} total)")
        
        # Split into train/test
        from torch.utils.data import random_split
        test_size = int(0.2 * len(full_dataset))
        train_size = len(full_dataset) - test_size
        train_dataset, test_dataset = random_split(full_dataset, [train_size, test_size])
        
        self.train_dataset = train_dataset
        self.test_dataset = test_dataset
        
        print(f"Dataset prepared with {len(train_dataset)} training and {len(test_dataset)} test samples")
        print(f"Classes: {self.class_names}")
        
        return train_dataset, test_dataset
    
    def create_feature_extractor(self, model_name='mobilenet_v2'):
        """Create a feature extractor for 224x224 images"""
        import torchvision.models as models
        
        print(f"Creating feature extractor using {model_name}...")
        
        if model_name == 'mobilenet_v2':
            # MobileNetV2 is very memory efficient
            model = models.mobilenet_v2(pretrained=True)
            # Remove classifier to get features
            feature_extractor = torch.nn.Sequential(*list(model.features))
            self.feature_dim = 1280  # MobileNetV2 produces 1280-dim features
            
        elif model_name == 'efficientnet_b0':
            # EfficientNet is also memory efficient
            model = models.efficientnet_b0(pretrained=True)
            # Remove classifier to get features
            feature_extractor = torch.nn.Sequential(*list(model.features))
            self.feature_dim = 1280  # EfficientNet-B0 produces 1280-dim features
            
        elif model_name == 'mobilenet_v3_small':
            # Even more memory efficient
            model = models.mobilenet_v3_small(pretrained=True)
            # Remove classifier to get features
            feature_extractor = torch.nn.Sequential(*list(model.features))
            self.feature_dim = 576  # MobileNetV3-Small produces 576-dim features
        
        # Freeze the feature extractor
        for param in feature_extractor.parameters():
            param.requires_grad = False
            
        # Move to device
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        feature_extractor = feature_extractor.to(device)
        feature_extractor.eval()
        
        self.feature_extractor = feature_extractor
        
        print(f"Feature extractor created using {model_name}, output dimension: {self.feature_dim}")
        return feature_extractor
    
    def extract_and_save_features(self, split='both', force_recompute=False):
        """Extract features from images and save to disk to save memory during training"""
        import torch
        from torch.utils.data import DataLoader
        import os
        import time
        import numpy as np
        
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        
        # Check if feature extractor exists
        if self.feature_extractor is None:
            print("Feature extractor not found, creating one...")
            self.create_feature_extractor()
        
        # Define function to extract features from a dataset
        def extract_from_dataset(dataset, name):
            # Check if features already exist
            features_file = os.path.join(self.feature_save_dir, f"{name}_features.pt")
            labels_file = os.path.join(self.feature_save_dir, f"{name}_labels.pt")
            
            if os.path.exists(features_file) and os.path.exists(labels_file) and not force_recompute:
                print(f"Loading pre-computed {name} features...")
                features = torch.load(features_file)
                labels = torch.load(labels_file)
                print(f"Loaded features shape: {features.shape}")
                return features, labels
            
            # Create dataloader
            loader = DataLoader(
                dataset, 
                batch_size=self.batch_size,
                shuffle=False,
                num_workers=self.num_workers,
                pin_memory=False
            )
            
            # Initialize storage
            all_features = []
            all_labels = []
            
            # Extract features
            print(f"Extracting features from {len(dataset)} {name} images...")
            start_time = time.time()
            
            with torch.no_grad():
                for i, (images, labels) in enumerate(loader):
                    if i % 5 == 0:
                        print(f"Processing batch {i+1}/{len(loader)}...")
                    
                    # Move to device
                    images = images.to(device)
                    
                    # Extract features
                    features = self.feature_extractor(images)
                    
                    # Global average pooling
                    features = torch.nn.functional.adaptive_avg_pool2d(features, (1, 1))
                    features = features.view(features.size(0), -1)
                    
                    # Store
                    all_features.append(features.cpu())
                    all_labels.append(labels)
                    
                    # Clean up
                    del images, features
                    torch.cuda.empty_cache()
            
            # Concatenate
            all_features = torch.cat(all_features, dim=0)
            all_labels = torch.cat(all_labels, dim=0)
            
            # Save to disk
            torch.save(all_features, features_file)
            torch.save(all_labels, labels_file)
            
            print(f"{name} features extracted in {time.time() - start_time:.2f}s")
            print(f"Feature shape: {all_features.shape}")
            print(f"Features saved to {features_file}")
            
            return all_features, all_labels
        
        # Extract features based on specified split
        if split == 'both' or split == 'train':
            self.train_features, self.train_labels = extract_from_dataset(self.train_dataset, 'train')
            
        if split == 'both' or split == 'test':
            self.test_features, self.test_labels = extract_from_dataset(self.test_dataset, 'test')
            
        return self.train_features, self.train_labels, self.test_features, self.test_labels
    
    def create_feature_dataloaders(self):
        """Create dataloaders from the extracted features"""
        from torch.utils.data import TensorDataset, DataLoader
        
        # Check if features are extracted
        if self.train_features is None or self.test_features is None:
            self.extract_and_save_features()
        
        # Create TensorDatasets
        train_dataset = TensorDataset(self.train_features, self.train_labels)
        test_dataset = TensorDataset(self.test_features, self.test_labels)
        
        # Create DataLoaders
        train_loader = DataLoader(
            train_dataset,
            batch_size=self.batch_size * 4,  # Can use larger batch size with features
            shuffle=True,
            num_workers=self.num_workers,
            pin_memory=False
        )
        
        test_loader = DataLoader(
            test_dataset,
            batch_size=self.batch_size * 4,
            shuffle=False,
            num_workers=self.num_workers,
            pin_memory=False
        )
        
        print(f"Feature DataLoaders created:")
        print(f"- Train: {len(train_loader)} batches")
        print(f"- Test: {len(test_loader)} batches")
        
        return train_loader, test_loader
    
    def create_feature_classifier(self, hidden_size=128, binary=True):
        """Create a binary neural network to classify the extracted features"""
        # Create a small BNN for feature classification
        if binary:
            # Binary Neural Network
            class FeatureBNN(torch.nn.Module):
                def __init__(self, feature_dim, hidden_size, num_classes):
                    super(FeatureBNN, self).__init__()
                    self.fc1 = BinaryLinear(feature_dim, hidden_size)
                    self.bn1 = torch.nn.BatchNorm1d(hidden_size)
                    self.fc2 = torch.nn.Linear(hidden_size, num_classes)
                    
                def forward(self, x):
                    x = self.fc1(x)
                    x = binary_activation(x)
                    x = self.bn1(x)
                    x = self.fc2(x)
                    return x
                
            model = FeatureBNN(self.feature_dim, hidden_size, len(self.class_names))
            print("Binary Neural Network classifier created for features")
        else:
            # Standard Neural Network
            model = torch.nn.Sequential(
                torch.nn.Linear(self.feature_dim, hidden_size),
                torch.nn.ReLU(),
                torch.nn.BatchNorm1d(hidden_size),
                torch.nn.Dropout(0.5),
                torch.nn.Linear(hidden_size, len(self.class_names))
            )
            print("Standard Neural Network classifier created for features")
        
        # Move to device
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        model = model.to(device)
        
        return model
    
    def train_feature_classifier(self, model, train_loader, test_loader, num_epochs=20, lr=0.001):
        """Train a classifier on the extracted features"""
        import torch.optim as optim
        from torch.optim.lr_scheduler import ReduceLROnPlateau
        
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        
        # Loss and optimizer
        criterion = torch.nn.CrossEntropyLoss()
        optimizer = optim.Adam(model.parameters(), lr=lr)
        scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=3)
        
        # Training loop
        print(f"Training feature classifier for {num_epochs} epochs...")
        history = []
        
        for epoch in range(num_epochs):
            model.train()
            running_loss = 0.0
            correct = 0
            total = 0
            
            for i, (features, labels) in enumerate(train_loader):
                features, labels = features.to(device), labels.to(device)
                
                # Forward pass
                outputs = model(features)
                loss = criterion(outputs, labels)
                
                # Backward pass
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
                
                # Track metrics
                running_loss += loss.item()
                _, predicted = outputs.max(1)
                total += labels.size(0)
                correct += predicted.eq(labels).sum().item()
            
            # Calculate epoch metrics
            train_loss = running_loss / len(train_loader)
            train_acc = 100.0 * correct / total
            
            # Evaluate
            model.eval()
            test_loss = 0.0
            test_correct = 0
            test_total = 0
            
            with torch.no_grad():
                for features, labels in test_loader:
                    features, labels = features.to(device), labels.to(device)
                    outputs = model(features)
                    loss = criterion(outputs, labels)
                    
                    test_loss += loss.item()
                    _, predicted = outputs.max(1)
                    test_total += labels.size(0)
                    test_correct += predicted.eq(labels).sum().item()
            
            # Calculate test metrics
            test_loss = test_loss / len(test_loader)
            test_acc = 100.0 * test_correct / test_total
            
            # Update scheduler
            scheduler.step(test_loss)
            
            # Record history
            history.append({
                'epoch': epoch + 1,
                'train_loss': train_loss,
                'train_acc': train_acc,
                'test_loss': test_loss,
                'test_acc': test_acc,
                'lr': optimizer.param_groups[0]['lr']
            })
            
            # Print progress
            print(f"Epoch {epoch+1}/{num_epochs} | "
                  f"Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.2f}% | "
                  f"Test Loss: {test_loss:.4f} | Test Acc: {test_acc:.2f}%")
        
        print("Training complete!")
        return history

# Example usage of feature extraction pipeline:
# pipeline = FeatureExtractionPipeline(dataset_path)
# pipeline.setup_data(max_samples_per_class=50)
# pipeline.create_feature_extractor('mobilenet_v3_small')
# pipeline.extract_and_save_features()
# train_loader, test_loader = pipeline.create_feature_dataloaders()
# model = pipeline.create_feature_classifier()
# history = pipeline.train_feature_classifier(model, train_loader, test_loader)

print("Feature extraction pipeline loaded and ready to use")

In [None]:
# Ultimate Memory-Optimized Training Recipe for 224x224 Images
# This cell provides a complete solution for training BNNs with 224x224 images
# with the most advanced memory optimization techniques

def train_memory_optimized_bnn_224x224(num_samples_per_class=50, 
                                      num_epochs=15, 
                                      optimization_level='extreme'):
    """
    Complete recipe for training a memory-optimized BNN on 224x224 images.
    
    Parameters:
    -----------
    num_samples_per_class : int
        Number of samples per class to use (limit dataset size)
    num_epochs : int
        Number of training epochs
    optimization_level : str
        'moderate', 'high', or 'extreme' memory optimization
        
    Returns:
    --------
    model : trained model
    history : training history
    eval_results : evaluation results
    """
    import time
    import torch
    import torch.nn as nn
    import torch.optim as optim
    from torch.utils.data import DataLoader, random_split
    import os
    import gc
    
    print(f"\n{'='*80}")
    print(f"ULTIMATE MEMORY-OPTIMIZED BNN TRAINING FOR 224x224 IMAGES")
    print(f"Optimization level: {optimization_level}")
    print(f"{'='*80}\n")
    
    # 1. Apply system-level memory optimization
    print("\n[1/7] Applying system-level memory optimization...")
    
    # Create memory optimizer
    memory_optimizer = AdvancedMemoryOptimizer()
    
    # Configure PyTorch memory settings based on optimization level
    if optimization_level == 'extreme':
        memory_fraction = 0.7
        block_size = 32
    elif optimization_level == 'high':
        memory_fraction = 0.8
        block_size = 16
    else:  # moderate
        memory_fraction = 0.9
        block_size = 32
        
    memory_optimizer.configure_pytorch_memory(fraction=memory_fraction, min_block_size=block_size)
    
    # Make sure device is set correctly
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"Using device: {device}")
    
    # 2. Configure memory settings
    print("\n[2/7] Configuring memory settings...")
    
    # Set optimization parameters based on level
    if optimization_level == 'extreme':
        memory_config_224 = {
            'image_size': 224,
            'batch_size': 1,
            'gradient_accumulation_steps': 32,
            'gc_frequency': 1,
            'memory_efficient': True,
            'use_mixed_precision': True,
            'max_samples_per_class': num_samples_per_class,
            'hidden_size': 64,
            'embedding_size': 128,
            'num_workers': 0,
            'pin_memory': False,
        }
    elif optimization_level == 'high':
        memory_config_224 = {
            'image_size': 224,
            'batch_size': 2,
            'gradient_accumulation_steps': 16,
            'gc_frequency': 1,
            'memory_efficient': True,
            'use_mixed_precision': True,
            'max_samples_per_class': num_samples_per_class,
            'hidden_size': 128,
            'embedding_size': 256,
            'num_workers': 0,
            'pin_memory': False,
        }
    else:  # moderate
        memory_config_224 = {
            'image_size': 224,
            'batch_size': 4,
            'gradient_accumulation_steps': 8,
            'gc_frequency': 2,
            'memory_efficient': True,
            'use_mixed_precision': True,
            'max_samples_per_class': num_samples_per_class,
            'hidden_size': 256,
            'embedding_size': 512,
            'num_workers': 0,
            'pin_memory': False,
        }
    
    # Update global memory config
    memory_config.update(memory_config_224)
    
    # Extract key settings
    batch_size = memory_config['batch_size']
    gradient_accumulation_steps = memory_config['gradient_accumulation_steps']
    use_mixed_precision = memory_config['use_mixed_precision']
    gc_frequency = memory_config['gc_frequency']
    image_size = memory_config['image_size']
    hidden_size = memory_config['hidden_size']
    embedding_size = memory_config['embedding_size']
    
    print(f"Memory configuration:")
    print(f"- Image size: {image_size}x{image_size}")
    print(f"- Batch size: {batch_size}")
    print(f"- Gradient accumulation steps: {gradient_accumulation_steps}")
    print(f"- Effective batch size: {batch_size * gradient_accumulation_steps}")
    print(f"- Mixed precision: {use_mixed_precision}")
    print(f"- GC frequency: {gc_frequency}")
    print(f"- Max samples per class: {num_samples_per_class}")
    
    # 3. Prepare dataset
    print("\n[3/7] Preparing dataset...")
    
    # Create transforms with memory efficiency
    transform = transforms.Compose([
        transforms.Resize((image_size, image_size)),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])
    
    # Load dataset
    if 'dataset_path' not in globals():
        raise ValueError("dataset_path not defined. Please set your dataset path first.")
    
    print(f"Loading dataset from {dataset_path}...")
    dataset, class_names = load_plant_disease_dataset(
        dataset_path,
        image_size=image_size,
        memory_efficient=True,
        max_samples_per_class=num_samples_per_class
    )
    
    # Split dataset
    train_size = int(0.8 * len(dataset))
    test_size = len(dataset) - train_size
    train_dataset, test_dataset = random_split(dataset, [train_size, test_size])
    
    # Create data loaders
    train_loader = DataLoader(
        train_dataset,
        batch_size=batch_size,
        shuffle=True,
        num_workers=0,
        pin_memory=False
    )
    
    test_loader = DataLoader(
        test_dataset,
        batch_size=batch_size,
        shuffle=False,
        num_workers=0,
        pin_memory=False
    )
    
    print(f"Dataset prepared:")
    print(f"- Training samples: {len(train_dataset)}")
    print(f"- Test samples: {len(test_dataset)}")
    print(f"- Classes: {class_names}")
    
    # 4. Create model with maximum memory efficiency
    print("\n[4/7] Creating memory-optimized model...")
    
    # Choose model based on optimization level
    if optimization_level == 'extreme':
        print("Using ultra memory-optimized ConvBNN model for 224x224 images")
        model = ConvBNN224x224(
            num_classes=len(class_names),
            dropout_rate=0.5
        ).to(device)
    else:
        print("Using memory-optimized BNN224x224MemoryOptimized model")
        model = BNN224x224MemoryOptimized(
            input_size=3*image_size*image_size,
            hidden_size=hidden_size,
            num_classes=len(class_names),
            dropout_rate=0.3
        ).to(device)
    
    # Apply tensor compression if in extreme mode
    if optimization_level == 'extreme':
        model = memory_optimizer.apply_tensor_compression(model)
    
    # 5. Configure training
    print("\n[5/7] Configuring training...")
    
    # Loss function
    criterion = nn.CrossEntropyLoss()
    
    # Optimizer with weight decay for regularization
    optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)
    
    # Learning rate scheduler
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
        optimizer, mode='min', factor=0.5, patience=2, verbose=True
    )
    
    print(f"Training configuration:")
    print(f"- Epochs: {num_epochs}")
    print(f"- Optimizer: Adam(lr=0.001, weight_decay=1e-5)")
    print(f"- Scheduler: ReduceLROnPlateau(factor=0.5, patience=2)")
    
    # 6. Train the model
    print("\n[6/7] Training model...")
    start_time = time.time()
    
    # Use memory-efficient training function
    training_results = train_memory_efficient(
        model=model,
        train_loader=train_loader,
        criterion=criterion,
        optimizer=optimizer,
        num_epochs=num_epochs,
        device=device,
        scheduler=scheduler,
        gradient_accumulation_steps=gradient_accumulation_steps,
        memory_efficient=True,
        gc_frequency=gc_frequency,
        use_mixed_precision=use_mixed_precision,
        early_stopping_patience=5
    )
    
    # Calculate training time
    train_time = time.time() - start_time
    print(f"Training completed in {train_time:.2f} seconds")
    print(f"Final loss: {training_results['final_loss']:.4f}")
    print(f"Final accuracy: {training_results['final_accuracy']:.2f}%")
    
    # 7. Evaluate the model
    print("\n[7/7] Evaluating model...")
    
    # Use memory-efficient evaluation function
    eval_results = evaluate_bnn_memory_efficient(
        model=model,
        test_loader=test_loader,
        criterion=criterion,
        device=device,
        class_names=class_names,
        use_mixed_precision=use_mixed_precision
    )
    
    print(f"Evaluation results:")
    print(f"- Test loss: {eval_results['loss']:.4f}")
    print(f"- Test accuracy: {eval_results['accuracy']:.2f}%")
    print(f"- Macro F1 score: {eval_results['macro_f1']:.4f}")
    
    # Per-class accuracy
    print("\nPer-class accuracy:")
    for class_name, acc in eval_results['class_accuracy'].items():
        print(f"- {class_name}: {acc:.2f}%")
    
    # 8. Save the model
    print("\nSaving model...")
    save_dir = 'results'
    os.makedirs(save_dir, exist_ok=True)
    
    timestamp = time.strftime("%Y%m%d_%H%M%S")
    save_path = f"{save_dir}/bnn_224x224_{timestamp}.pth"
    
    torch.save({
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'class_names': class_names,
        'config': memory_config,
        'training_results': training_results,
        'eval_results': {k: v for k, v in eval_results.items() if not isinstance(v, torch.Tensor)}
    }, save_path)
    
    print(f"Model saved to {save_path}")
    
    print(f"\n{'='*80}")
    print(f"TRAINING COMPLETE")
    print(f"Final test accuracy: {eval_results['accuracy']:.2f}%")
    print(f"{'='*80}\n")
    
    return model, training_results, eval_results

# To use this ultimate training recipe, uncomment the following line:
# model, training_results, eval_results = train_memory_optimized_bnn_224x224(
#     num_samples_per_class=50,  # Adjust based on available memory
#     num_epochs=15,
#     optimization_level='extreme'  # 'moderate', 'high', or 'extreme'
# )

print("Ultimate memory-optimized training recipe loaded and ready to use")

# Comprehensive Memory Optimization Guide for 224x224 Images

This guide provides a complete overview of all memory optimization techniques implemented in this notebook to efficiently train Binary Neural Networks with 224x224 images on limited GPU memory.

## Optimization Categories

### A. Data Pipeline Optimizations

1. **Batch Size Reduction** 
   - Use batch_size=1 with gradient accumulation to simulate larger batches
   - Effective batch size = batch_size × gradient_accumulation_steps
   
2. **Limited Dataset Size**
   - Use max_samples_per_class to limit number of images loaded
   - Preserve class balance while reducing memory footprint

3. **Memory-Efficient Data Loading**
   - Disabled worker threads (num_workers=0)
   - Disabled pin_memory to reduce GPU memory overhead
   - On-demand loading instead of preloading

4. **Patch-Based Processing**
   - Process 224×224 images as smaller patches (e.g., 112×112)
   - Combine patch predictions for full-image classification
   - Available via `create_patch_dataset_loader()` function

5. **Feature Pre-extraction**
   - Use lightweight pre-trained model to extract features
   - Train BNN on extracted features, not raw images
   - See `FeatureExtractionPipeline` class

### B. Model Architecture Optimizations

1. **Convolutional Feature Extraction**
   - Replace flattened input (150,528 dims) with conv layers
   - Reduces parameters by ~99% for same input size
   - Implemented in `ConvBNN224x224` and `BNN224x224MemoryOptimized` classes

2. **Progressive Dimensionality Reduction**
   - Use multiple smaller steps to reduce dimensions
   - Avoids massive dense layers that consume memory
   - More aggressive reduction for early layers

3. **Binary Weights and Activations**
   - 1-bit weights and activations to reduce model size
   - Memory-efficient binary operations
   - Use `BinaryLinear` and `binary_activation` functions

4. **Depthwise Separable Convolutions**
   - Replace standard convolutions for efficiency
   - Much lower parameter count while preserving performance
   - Implemented in `ConvBNN224x224` model

5. **Memory-Efficient Architectures**
   - Multiple architecture options based on memory constraints
   - MobileNetV3-Small for feature extraction
   - Minimal fully-connected layers

### C. Training Process Optimizations

1. **Gradient Checkpointing**
   - Discard intermediate activations during forward pass
   - Recompute them during backward pass
   - Trades computation for memory savings
   - Implemented in `BNN224x224MemoryOptimized` model

2. **Mixed Precision Training**
   - Use FP16 for most operations
   - GradScaler to avoid underflow
   - Reduces memory usage by up to 50%

3. **Aggressive Garbage Collection**
   - Clear PyTorch cache after each batch
   - Run Python's garbage collector frequently
   - Delete unused tensors explicitly

4. **Layer Freezing**
   - Freeze early layers to reduce gradient memory
   - Train only the final layers
   - See `freeze_early_layers()` method

5. **Tensor Compression**
   - Use half-precision weights where possible
   - Keep critical layers in full precision
   - See `apply_tensor_compression()` method

### D. System-Level Optimizations

1. **CUDA Memory Configuration**
   - Set PYTORCH_CUDA_ALLOC_CONF for better memory allocation
   - Control memory fragmentation with max_split_size_mb
   - Modify garbage collection threshold

2. **Memory Limits**
   - Set per-process memory fraction to avoid OOM errors
   - Limit maximum memory usage to leave room for system
   - See `configure_pytorch_memory()` method

3. **CUDNN Settings**
   - Disable benchmarking for more stable memory usage
   - Enable deterministic algorithms
   - Trade speed for reliability

## Choosing the Right Strategy

1. **For Extreme Memory Constraints:**
   - Use `train_memory_optimized_bnn_224x224()` with 'extreme' optimization level
   - Consider using the feature extraction pipeline
   - Limit samples per class to 10-20

2. **For Moderate Memory Constraints:**
   - Use `train_memory_optimized_bnn_224x224()` with 'moderate' optimization level
   - Use the `BNN224x224MemoryOptimized` model
   - Limit samples per class to 50-100

3. **For Performance with Memory Efficiency:**
   - Use the `ConvBNN224x224` model
   - Enable mixed precision training
   - Use moderate batch size with gradient accumulation

## Recipe for Success

```python
# 1. Configure memory settings
memory_optimizer = AdvancedMemoryOptimizer()
memory_optimizer.configure_pytorch_memory(fraction=0.7)

# 2. Use the ultimate training recipe
model, results, eval_results = train_memory_optimized_bnn_224x224(
    num_samples_per_class=40,
    num_epochs=15,
    optimization_level='extreme'
)
```

This comprehensive approach ensures you can train Binary Neural Networks on 224×224 images even on GPUs with limited memory.