# 🧠 Model Training: From Easy to Super Advanced

**A Complete Journey Through Computer Vision Model Training**

This notebook takes you from the most basic model training concepts to the cutting-edge, super-advanced techniques used in research and production systems.

## 📚 What You'll Learn

### 🌱 **Beginner Level (Easy)**
- Simple linear classifiers
- Basic neural networks with PyTorch
- Transfer learning with pre-trained models
- Basic data augmentation

### 🚀 **Intermediate Level**
- Custom CNN architectures
- Advanced data augmentation
- Learning rate scheduling
- Model ensembling

### ⚡ **Advanced Level**
- Custom loss functions and optimizers
- Multi-GPU training
- Mixed precision training
- Model compression and pruning

### 🔬 **Super Advanced Level (Research-Grade)**
- Neural Architecture Search (NAS)
- Meta-learning and few-shot learning
- Self-supervised learning
- Adversarial training and robustness
- Federated learning
- Gradient accumulation and large batch training
- Custom CUDA kernels integration
- Distributed training across multiple nodes

---

**⚠️ Requirements:**
- PyTorch 2.0+
- CUDA-capable GPU (recommended)
- 16GB+ RAM for advanced sections
- Multiple GPUs for super-advanced distributed training

In [None]:
# 🔧 Complete Setup: All Libraries for Easy to Super Advanced Training
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset, random_split
from torch.amp import autocast, GradScaler  # Mixed precision
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

import torchvision
import torchvision.transforms as transforms
from torchvision.models import *
import torchvision.datasets as datasets

# Core libraries
import numpy as np
import matplotlib.pyplot as plt
import cv2
from PIL import Image
import os
import time
import random
from pathlib import Path
import json
import pickle
from tqdm import tqdm

# Advanced libraries
import timm  # PyTorch Image Models - state-of-the-art architectures
import albumentations as A  # Advanced augmentations
from albumentations.pytorch import ToTensorV2
import optuna  # Hyperparameter optimization
import wandb  # Experiment tracking

# Super advanced libraries
try:
    import pytorch_lightning as pl  # High-level PyTorch wrapper
    from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping
    from pytorch_lightning.loggers import WandbLogger
except ImportError:
    print("⚠️  PyTorch Lightning not installed - some super advanced features will be unavailable")

try:
    import higher  # Meta-learning
except ImportError:
    print("⚠️  Higher not installed - meta-learning examples will use manual implementation")

try:
    from nni.compression.pytorch import ModelSpeedup
    from nni.compression.pytorch.pruning import L1FilterPruner
except ImportError:
    print("⚠️  NNI not installed - model compression examples will use manual implementation")

# Scientific computing
import scipy
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns

# Set device and check capabilities
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"🖥️  Device: {device}")

if torch.cuda.is_available():
    print(f"🔥 GPU: {torch.cuda.get_device_name()}")
    print(f"💾 GPU Memory: {torch.cuda.get_device_properties(0).total_memory // 1024**3} GB")
    print(f"🔢 CUDA Compute Capability: {torch.cuda.get_device_capability()}")
    print(f"🚀 Mixed Precision Support: {torch.cuda.is_bf16_supported()}")

# Multi-GPU setup
if torch.cuda.device_count() > 1:
    print(f"🔥 Multiple GPUs detected: {torch.cuda.device_count()}")
    print("✅ Ready for advanced multi-GPU training")

# Set random seeds for reproducibility
def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

set_seed(42)
print("✅ All libraries imported and environment configured!")
print("🎯 Ready for Easy to Super Advanced model training!")

# 🌱 Level 1: Easy - Basic Model Training

Let's start with the absolute basics. We'll create simple datasets and train basic models.

In [None]:
class EasyTraining:
    """
    Level 1: Easy model training for beginners
    """
    
    def __init__(self):
        self.device = device
        
    def create_simple_dataset(self):
        """Create a simple synthetic dataset for learning"""
        print("📊 Creating Simple Synthetic Dataset")
        print("=" * 38)
        
        # Create simple 2D classification data
        np.random.seed(42)
        n_samples = 1000
        
        # Class 0: circles
        angles = np.random.uniform(0, 2*np.pi, n_samples//2)
        radius = np.random.uniform(1, 3, n_samples//2)
        x1 = radius * np.cos(angles) + np.random.normal(0, 0.3, n_samples//2)
        y1 = radius * np.sin(angles) + np.random.normal(0, 0.3, n_samples//2)
        
        # Class 1: diagonal line
        x2 = np.random.uniform(-5, 5, n_samples//2)
        y2 = x2 + np.random.normal(0, 0.5, n_samples//2)
        
        # Combine data
        X = np.vstack([np.column_stack([x1, y1]), np.column_stack([x2, y2])])
        y = np.hstack([np.zeros(n_samples//2), np.ones(n_samples//2)])
        
        # Visualize
        plt.figure(figsize=(10, 6))
        scatter = plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', alpha=0.7)
        plt.colorbar(scatter)
        plt.title('Simple 2D Classification Dataset')
        plt.xlabel('Feature 1')
        plt.ylabel('Feature 2')
        plt.show()
        
        # Convert to PyTorch tensors
        X_tensor = torch.FloatTensor(X).to(self.device)
        y_tensor = torch.LongTensor(y).to(self.device)
        
        print(f"✅ Dataset created: {X.shape[0]} samples, {X.shape[1]} features")
        return X_tensor, y_tensor
    
    def simple_neural_network(self, input_size=2, hidden_size=64, num_classes=2):
        """Create a simple feedforward neural network"""
        print("🧠 Creating Simple Neural Network")
        print("=" * 32)
        
        class SimpleNN(nn.Module):
            def __init__(self, input_size, hidden_size, num_classes):
                super(SimpleNN, self).__init__()
                self.fc1 = nn.Linear(input_size, hidden_size)
                self.relu = nn.ReLU()
                self.fc2 = nn.Linear(hidden_size, hidden_size)
                self.fc3 = nn.Linear(hidden_size, num_classes)
                self.dropout = nn.Dropout(0.2)
                
            def forward(self, x):
                x = self.fc1(x)
                x = self.relu(x)
                x = self.dropout(x)
                x = self.fc2(x)
                x = self.relu(x)
                x = self.dropout(x)
                x = self.fc3(x)
                return x
        
        model = SimpleNN(input_size, hidden_size, num_classes).to(self.device)
        
        # Print model architecture
        print(f"Model Architecture:")
        print(model)
        
        # Count parameters
        total_params = sum(p.numel() for p in model.parameters())
        trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
        print(f"\n📊 Total parameters: {total_params:,}")
        print(f"📊 Trainable parameters: {trainable_params:,}")
        
        return model
    
    def basic_training_loop(self, model, X, y, epochs=100, lr=0.001):
        """Basic training loop with visualization"""
        print("🏃 Starting Basic Training Loop")
        print("=" * 30)
        
        # Split data
        n_train = int(0.8 * len(X))
        indices = torch.randperm(len(X))
        train_indices, val_indices = indices[:n_train], indices[n_train:]
        
        X_train, X_val = X[train_indices], X[val_indices]
        y_train, y_val = y[train_indices], y[val_indices]
        
        # Loss and optimizer
        criterion = nn.CrossEntropyLoss()
        optimizer = optim.Adam(model.parameters(), lr=lr)
        
        # Training history
        train_losses = []
        val_losses = []
        train_accs = []
        val_accs = []
        
        print(f"Training on {len(X_train)} samples, validating on {len(X_val)} samples")
        
        # Training loop
        for epoch in range(epochs):
            # Training
            model.train()
            optimizer.zero_grad()
            
            outputs = model(X_train)
            loss = criterion(outputs, y_train)
            loss.backward()
            optimizer.step()
            
            # Calculate training accuracy
            _, predicted = torch.max(outputs.data, 1)
            train_acc = (predicted == y_train).float().mean().item()
            
            # Validation
            model.eval()
            with torch.no_grad():
                val_outputs = model(X_val)
                val_loss = criterion(val_outputs, y_val)
                _, val_predicted = torch.max(val_outputs.data, 1)
                val_acc = (val_predicted == y_val).float().mean().item()
            
            # Store history
            train_losses.append(loss.item())
            val_losses.append(val_loss.item())
            train_accs.append(train_acc)
            val_accs.append(val_acc)
            
            # Print progress
            if (epoch + 1) % 20 == 0:
                print(f"Epoch [{epoch+1}/{epochs}] - "
                      f"Train Loss: {loss.item():.4f}, Train Acc: {train_acc:.4f} - "
                      f"Val Loss: {val_loss.item():.4f}, Val Acc: {val_acc:.4f}")
        
        # Plot training history
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
        
        # Loss plot
        ax1.plot(train_losses, label='Training Loss', color='blue')
        ax1.plot(val_losses, label='Validation Loss', color='red')
        ax1.set_xlabel('Epoch')
        ax1.set_ylabel('Loss')
        ax1.set_title('Training and Validation Loss')
        ax1.legend()
        ax1.grid(True)
        
        # Accuracy plot
        ax2.plot(train_accs, label='Training Accuracy', color='blue')
        ax2.plot(val_accs, label='Validation Accuracy', color='red')
        ax2.set_xlabel('Epoch')
        ax2.set_ylabel('Accuracy')
        ax2.set_title('Training and Validation Accuracy')
        ax2.legend()
        ax2.grid(True)
        
        plt.tight_layout()
        plt.show()
        
        print(f"\n✅ Training completed!")
        print(f"📊 Final Training Accuracy: {train_accs[-1]:.4f}")
        print(f"📊 Final Validation Accuracy: {val_accs[-1]:.4f}")
        
        return model, (train_losses, val_losses, train_accs, val_accs)

# Create instance
easy_trainer = EasyTraining()
print("🌱 Easy Training module initialized!")

In [None]:
# 🎯 RUN EASY TRAINING DEMO
print("🚀 RUNNING EASY TRAINING DEMONSTRATION")
print("=" * 40)

# Create dataset
X, y = easy_trainer.create_simple_dataset()

# Create model
simple_model = easy_trainer.simple_neural_network()

# Train model
trained_model, history = easy_trainer.basic_training_loop(simple_model, X, y, epochs=100)

print("\n🎉 Easy training demonstration completed!")
print("💡 You've successfully trained your first neural network!")

# 🚀 Level 2: Intermediate - Image Classification with CNNs

Now let's move to image data and convolutional neural networks with more advanced techniques.

In [None]:
class IntermediateTraining:
    """
    Level 2: Intermediate training with CNNs and real image data
    """
    
    def __init__(self):
        self.device = device
        
    def create_real_world_dataset(self):
        """Create or load a real-world image dataset"""
        print("📸 Creating Real-World Image Dataset")
        print("=" * 37)
        
        # Download CIFAR-10 dataset
        transform_train = transforms.Compose([
            transforms.RandomCrop(32, padding=4),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
        ])
        
        transform_test = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
        ])
        
        print("📥 Downloading CIFAR-10 dataset...")
        trainset = torchvision.datasets.CIFAR10(
            root='./data', train=True, download=True, transform=transform_train)
        testset = torchvision.datasets.CIFAR10(
            root='./data', train=False, download=True, transform=transform_test)
        
        # Create data loaders
        train_loader = DataLoader(trainset, batch_size=128, shuffle=True, num_workers=2)
        test_loader = DataLoader(testset, batch_size=100, shuffle=False, num_workers=2)
        
        # CIFAR-10 classes
        classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
        
        print(f"✅ Dataset loaded: {len(trainset)} training, {len(testset)} test samples")
        print(f"📊 Classes: {classes}")
        
        # Visualize some samples
        self.visualize_dataset(train_loader, classes)
        
        return train_loader, test_loader, classes
    
    def visualize_dataset(self, dataloader, classes):
        """Visualize dataset samples"""
        # Get one batch
        dataiter = iter(dataloader)
        images, labels = next(dataiter)
        
        # Denormalize for visualization
        mean = torch.tensor([0.4914, 0.4822, 0.4465]).view(3, 1, 1)
        std = torch.tensor([0.2023, 0.1994, 0.2010]).view(3, 1, 1)
        images = images * std + mean
        images = torch.clamp(images, 0, 1)
        
        # Plot samples
        fig, axes = plt.subplots(2, 8, figsize=(16, 4))
        for i in range(16):
            row, col = i // 8, i % 8
            axes[row, col].imshow(images[i].permute(1, 2, 0))
            axes[row, col].set_title(f'{classes[labels[i]]}')
            axes[row, col].axis('off')
        
        plt.suptitle('CIFAR-10 Dataset Samples')
        plt.tight_layout()
        plt.show()
    
    def create_custom_cnn(self, num_classes=10):
        """Create a custom CNN architecture"""
        print("🏗️  Creating Custom CNN Architecture")
        print("=" * 35)
        
        class CustomCNN(nn.Module):
            def __init__(self, num_classes=10):
                super(CustomCNN, self).__init__()
                
                # Convolutional layers
                self.conv1 = nn.Conv2d(3, 64, kernel_size=3, padding=1)
                self.bn1 = nn.BatchNorm2d(64)
                self.conv2 = nn.Conv2d(64, 64, kernel_size=3, padding=1)
                self.bn2 = nn.BatchNorm2d(64)
                
                self.conv3 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
                self.bn3 = nn.BatchNorm2d(128)
                self.conv4 = nn.Conv2d(128, 128, kernel_size=3, padding=1)
                self.bn4 = nn.BatchNorm2d(128)
                
                self.conv5 = nn.Conv2d(128, 256, kernel_size=3, padding=1)
                self.bn5 = nn.BatchNorm2d(256)
                self.conv6 = nn.Conv2d(256, 256, kernel_size=3, padding=1)
                self.bn6 = nn.BatchNorm2d(256)
                
                # Pooling and dropout
                self.pool = nn.MaxPool2d(2, 2)
                self.dropout = nn.Dropout(0.5)
                self.dropout_conv = nn.Dropout2d(0.25)
                
                # Fully connected layers
                self.fc1 = nn.Linear(256 * 4 * 4, 512)
                self.fc2 = nn.Linear(512, 256)
                self.fc3 = nn.Linear(256, num_classes)
                
                # Activation functions
                self.relu = nn.ReLU(inplace=True)
                
            def forward(self, x):
                # Block 1
                x = self.relu(self.bn1(self.conv1(x)))
                x = self.relu(self.bn2(self.conv2(x)))
                x = self.pool(x)
                x = self.dropout_conv(x)
                
                # Block 2
                x = self.relu(self.bn3(self.conv3(x)))
                x = self.relu(self.bn4(self.conv4(x)))
                x = self.pool(x)
                x = self.dropout_conv(x)
                
                # Block 3
                x = self.relu(self.bn5(self.conv5(x)))
                x = self.relu(self.bn6(self.conv6(x)))
                x = self.pool(x)
                x = self.dropout_conv(x)
                
                # Classifier
                x = x.view(-1, 256 * 4 * 4)
                x = self.relu(self.fc1(x))
                x = self.dropout(x)
                x = self.relu(self.fc2(x))
                x = self.dropout(x)
                x = self.fc3(x)
                
                return x
        
        model = CustomCNN(num_classes).to(self.device)
        
        # Print model info
        total_params = sum(p.numel() for p in model.parameters())
        trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
        
        print(f"📊 Total parameters: {total_params:,}")
        print(f"📊 Trainable parameters: {trainable_params:,}")
        
        # Calculate model size
        param_size = 0
        for param in model.parameters():
            param_size += param.nelement() * param.element_size()
        buffer_size = 0
        for buffer in model.buffers():
            buffer_size += buffer.nelement() * buffer.element_size()
        model_size = (param_size + buffer_size) / 1024 / 1024
        print(f"📊 Model size: {model_size:.2f} MB")
        
        return model
    
    def advanced_training_loop(self, model, train_loader, test_loader, epochs=50):
        """Advanced training with scheduling and monitoring"""
        print("🏃 Starting Advanced Training Loop")
        print("=" * 33)
        
        # Loss and optimizer
        criterion = nn.CrossEntropyLoss()
        optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=1e-4)
        
        # Learning rate scheduler
        scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=epochs)
        
        # Training history
        history = {
            'train_loss': [], 'train_acc': [],
            'test_loss': [], 'test_acc': [],
            'lr': []
        }
        
        best_acc = 0.0
        
        for epoch in range(epochs):
            # Training phase
            model.train()
            train_loss = 0.0
            train_correct = 0
            train_total = 0
            
            progress_bar = tqdm(train_loader, desc=f'Epoch {epoch+1}/{epochs}')
            
            for batch_idx, (data, target) in enumerate(progress_bar):
                data, target = data.to(self.device), target.to(self.device)
                
                optimizer.zero_grad()
                output = model(data)
                loss = criterion(output, target)
                loss.backward()
                optimizer.step()
                
                # Statistics
                train_loss += loss.item()
                _, predicted = output.max(1)
                train_total += target.size(0)
                train_correct += predicted.eq(target).sum().item()
                
                # Update progress bar
                progress_bar.set_postfix({
                    'Loss': f'{loss.item():.4f}',
                    'Acc': f'{100.*train_correct/train_total:.2f}%'
                })
            
            # Testing phase
            model.eval()
            test_loss = 0.0
            test_correct = 0
            test_total = 0
            
            with torch.no_grad():
                for data, target in test_loader:
                    data, target = data.to(self.device), target.to(self.device)
                    output = model(data)
                    test_loss += criterion(output, target).item()
                    
                    _, predicted = output.max(1)
                    test_total += target.size(0)
                    test_correct += predicted.eq(target).sum().item()
            
            # Calculate metrics
            train_loss /= len(train_loader)
            train_acc = 100. * train_correct / train_total
            test_loss /= len(test_loader)
            test_acc = 100. * test_correct / test_total
            
            # Update learning rate
            scheduler.step()
            current_lr = optimizer.param_groups[0]['lr']
            
            # Store history
            history['train_loss'].append(train_loss)
            history['train_acc'].append(train_acc)
            history['test_loss'].append(test_loss)
            history['test_acc'].append(test_acc)
            history['lr'].append(current_lr)
            
            # Save best model
            if test_acc > best_acc:
                best_acc = test_acc
                torch.save(model.state_dict(), 'best_model.pth')
            
            # Print epoch results
            print(f'Epoch: {epoch+1:3d} | '
                  f'Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.2f}% | '
                  f'Test Loss: {test_loss:.4f} | Test Acc: {test_acc:.2f}% | '
                  f'LR: {current_lr:.6f}')
        
        print(f"\n✅ Training completed! Best accuracy: {best_acc:.2f}%")
        
        # Plot training history
        self.plot_training_history(history)
        
        return model, history
    
    def plot_training_history(self, history):
        """Plot comprehensive training history"""
        fig, axes = plt.subplots(2, 2, figsize=(15, 10))
        
        # Loss plot
        axes[0, 0].plot(history['train_loss'], label='Training Loss', color='blue')
        axes[0, 0].plot(history['test_loss'], label='Test Loss', color='red')
        axes[0, 0].set_xlabel('Epoch')
        axes[0, 0].set_ylabel('Loss')
        axes[0, 0].set_title('Training and Test Loss')
        axes[0, 0].legend()
        axes[0, 0].grid(True)
        
        # Accuracy plot
        axes[0, 1].plot(history['train_acc'], label='Training Accuracy', color='blue')
        axes[0, 1].plot(history['test_acc'], label='Test Accuracy', color='red')
        axes[0, 1].set_xlabel('Epoch')
        axes[0, 1].set_ylabel('Accuracy (%)')
        axes[0, 1].set_title('Training and Test Accuracy')
        axes[0, 1].legend()
        axes[0, 1].grid(True)
        
        # Learning rate plot
        axes[1, 0].plot(history['lr'], color='green')
        axes[1, 0].set_xlabel('Epoch')
        axes[1, 0].set_ylabel('Learning Rate')
        axes[1, 0].set_title('Learning Rate Schedule')
        axes[1, 0].grid(True)
        
        # Overfitting analysis
        gap = np.array(history['train_acc']) - np.array(history['test_acc'])
        axes[1, 1].plot(gap, color='purple')
        axes[1, 1].set_xlabel('Epoch')
        axes[1, 1].set_ylabel('Accuracy Gap (%)')
        axes[1, 1].set_title('Overfitting Analysis (Train - Test Acc)')
        axes[1, 1].grid(True)
        
        plt.tight_layout()
        plt.show()

# Create instance
intermediate_trainer = IntermediateTraining()
print("🚀 Intermediate Training module initialized!")

In [None]:
# 🎯 RUN INTERMEDIATE TRAINING DEMO
print("🚀 RUNNING INTERMEDIATE TRAINING DEMONSTRATION")
print("=" * 46)

# Create dataset
train_loader, test_loader, classes = intermediate_trainer.create_real_world_dataset()

# Create custom CNN
cnn_model = intermediate_trainer.create_custom_cnn(num_classes=10)

# Train with advanced techniques (reduce epochs for demo)
print("\n🏃 Starting advanced training (using fewer epochs for demonstration)...")
trained_cnn, cnn_history = intermediate_trainer.advanced_training_loop(
    cnn_model, train_loader, test_loader, epochs=10
)

print("\n🎉 Intermediate training demonstration completed!")
print("💡 You've successfully trained a CNN with advanced techniques!")

# ⚡ Level 3: Advanced - Multi-GPU, Mixed Precision & Custom Components

Now we enter the advanced realm with multi-GPU training, mixed precision, custom loss functions, and model optimization techniques.

In [None]:
class AdvancedTraining:
    """
    Level 3: Advanced training with multi-GPU, mixed precision, and custom components
    """
    
    def __init__(self):
        self.device = device
        self.scaler = GradScaler()  # For mixed precision
        
    def custom_loss_functions(self):
        """Demonstrate custom loss function implementations"""
        print("🔧 Custom Loss Functions")
        print("=" * 24)
        
        class FocalLoss(nn.Module):
            """Focal Loss for addressing class imbalance"""
            def __init__(self, alpha=1, gamma=2, reduction='mean'):
                super(FocalLoss, self).__init__()
                self.alpha = alpha
                self.gamma = gamma
                self.reduction = reduction
                
            def forward(self, inputs, targets):
                ce_loss = F.cross_entropy(inputs, targets, reduction='none')
                pt = torch.exp(-ce_loss)
                focal_loss = self.alpha * (1-pt)**self.gamma * ce_loss
                
                if self.reduction == 'mean':
                    return focal_loss.mean()
                elif self.reduction == 'sum':
                    return focal_loss.sum()
                else:
                    return focal_loss
        
        class LabelSmoothingLoss(nn.Module):
            """Label Smoothing for better generalization"""
            def __init__(self, num_classes, smoothing=0.1):
                super(LabelSmoothingLoss, self).__init__()
                self.num_classes = num_classes
                self.smoothing = smoothing
                
            def forward(self, inputs, targets):
                log_probs = F.log_softmax(inputs, dim=1)
                targets_one_hot = torch.zeros_like(log_probs)
                targets_one_hot.fill_(self.smoothing / (self.num_classes - 1))
                targets_one_hot.scatter_(1, targets.unsqueeze(1), 1.0 - self.smoothing)
                
                loss = (-targets_one_hot * log_probs).sum(dim=1).mean()
                return loss
        
        class CutMixLoss(nn.Module):
            """CutMix loss for data augmentation"""
            def __init__(self, criterion):
                super(CutMixLoss, self).__init__()
                self.criterion = criterion
                
            def forward(self, pred, y_a, y_b, lam):
                return lam * self.criterion(pred, y_a) + (1 - lam) * self.criterion(pred, y_b)
        
        print("✅ Custom loss functions defined:")
        print("   - Focal Loss: Handles class imbalance")
        print("   - Label Smoothing: Improves generalization")
        print("   - CutMix Loss: Advanced data augmentation")
        
        return FocalLoss, LabelSmoothingLoss, CutMixLoss
    
    def mixed_precision_training(self, model, train_loader, test_loader, epochs=20):
        """Training with Automatic Mixed Precision (AMP)"""
        print("🚀 Mixed Precision Training")
        print("=" * 26)
        
        criterion = nn.CrossEntropyLoss()
        optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=1e-4)
        scheduler = optim.lr_scheduler.OneCycleLR(
            optimizer, max_lr=0.01, steps_per_epoch=len(train_loader), epochs=epochs
        )\n        
        history = {'train_loss': [], 'train_acc': [], 'test_loss': [], 'test_acc': []}\n        best_acc = 0.0\n        
        print(f"🔥 Training with Mixed Precision (FP16)...")
        \n        for epoch in range(epochs):\n            # Training phase\n            model.train()\n            train_loss = 0.0\n            train_correct = 0\n            train_total = 0\n            \n            progress_bar = tqdm(train_loader, desc=f'Epoch {epoch+1}/{epochs} (AMP)')\n            \n            for batch_idx, (data, target) in enumerate(progress_bar):\n                data, target = data.to(self.device), target.to(self.device)\n                \n                optimizer.zero_grad()\n                \n                # Mixed precision forward pass\n                with autocast(device_type='cuda' if torch.cuda.is_available() else 'cpu'):\n                    output = model(data)\n                    loss = criterion(output, target)\n                \n                # Mixed precision backward pass\n                self.scaler.scale(loss).backward()\n                self.scaler.step(optimizer)\n                self.scaler.update()\n                scheduler.step()\n                \n                # Statistics\n                train_loss += loss.item()\n                _, predicted = output.max(1)\n                train_total += target.size(0)\n                train_correct += predicted.eq(target).sum().item()\n                \n                progress_bar.set_postfix({\n                    'Loss': f'{loss.item():.4f}',\n                    'Acc': f'{100.*train_correct/train_total:.2f}%',\n                    'LR': f'{scheduler.get_last_lr()[0]:.6f}'\n                })\n            \n            # Testing phase\n            model.eval()\n            test_loss = 0.0\n            test_correct = 0\n            test_total = 0\n            \n            with torch.no_grad():\n                for data, target in test_loader:\n                    data, target = data.to(self.device), target.to(self.device)\n                    \n                    with autocast(device_type='cuda' if torch.cuda.is_available() else 'cpu'):\n                        output = model(data)\n                        test_loss += criterion(output, target).item()\n                    \n                    _, predicted = output.max(1)\n                    test_total += target.size(0)\n                    test_correct += predicted.eq(target).sum().item()\n            \n            # Calculate metrics\n            train_loss /= len(train_loader)\n            train_acc = 100. * train_correct / train_total\n            test_loss /= len(test_loader)\n            test_acc = 100. * test_correct / test_total\n            \n            # Store history\n            history['train_loss'].append(train_loss)\n            history['train_acc'].append(train_acc)\n            history['test_loss'].append(test_loss)\n            history['test_acc'].append(test_acc)\n            \n            if test_acc > best_acc:\n                best_acc = test_acc\n                torch.save(model.state_dict(), 'best_model_amp.pth')\n            \n            print(f'Epoch: {epoch+1:3d} | '\n                  f'Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.2f}% | '\n                  f'Test Loss: {test_loss:.4f} | Test Acc: {test_acc:.2f}%')\n        \n        print(f\"\\n✅ Mixed Precision Training completed! Best accuracy: {best_acc:.2f}%\")\n        print(f\"💡 Mixed precision can provide 1.5-2x speedup with minimal accuracy loss\")\n        \n        return model, history\n    \n    def model_ensembling(self, models, test_loader):\n        \"\"\"Demonstrate model ensembling techniques\"\"\"\n        print(\"🎭 Model Ensembling\")\n        print(\"=\" * 17)\n        \n        def evaluate_ensemble(models, dataloader, ensemble_method='average'):\n            \"\"\"Evaluate ensemble of models\"\"\"\n            all_models_eval = [model.eval() for model in models]\n            correct = 0\n            total = 0\n            \n            with torch.no_grad():\n                for data, target in dataloader:\n                    data, target = data.to(self.device), target.to(self.device)\n                    \n                    # Get predictions from all models\n                    predictions = []\n                    for model in models:\n                        with autocast(device_type='cuda' if torch.cuda.is_available() else 'cpu'):\n                            output = model(data)\n                            predictions.append(F.softmax(output, dim=1))\n                    \n                    # Ensemble predictions\n                    if ensemble_method == 'average':\n                        ensemble_pred = torch.stack(predictions).mean(dim=0)\n                    elif ensemble_method == 'max':\n                        ensemble_pred = torch.stack(predictions).max(dim=0)[0]\n                    elif ensemble_method == 'weighted':\n                        # Simple equal weighting - could be learned\n                        weights = torch.ones(len(models)) / len(models)\n                        ensemble_pred = sum(w * pred for w, pred in zip(weights, predictions))\n                    \n                    _, predicted = ensemble_pred.max(1)\n                    total += target.size(0)\n                    correct += predicted.eq(target).sum().item()\n            \n            return 100. * correct / total\n        \n        # Test different ensemble methods\n        methods = ['average', 'max', 'weighted']\n        results = {}\n        \n        for method in methods:\n            acc = evaluate_ensemble(models, test_loader, method)\n            results[method] = acc\n            print(f\"📊 {method.capitalize()} Ensemble Accuracy: {acc:.2f}%\")\n        \n        # Compare with individual models\n        print(\"\\n📊 Individual Model Performance:\")\n        for i, model in enumerate(models):\n            individual_acc = evaluate_ensemble([model], test_loader)\n            print(f\"   Model {i+1}: {individual_acc:.2f}%\")\n        \n        print(f\"\\n✅ Best ensemble method: {max(results, key=results.get)} ({max(results.values()):.2f}%)\")\n        \n        return results\n    \n    def model_compression_pruning(self, model):\n        \"\"\"Demonstrate model compression through pruning\"\"\"\n        print(\"✂️  Model Compression & Pruning\")\n        print(\"=\" * 29)\n        \n        import torch.nn.utils.prune as prune\n        \n        # Calculate original model size\n        original_size = sum(p.numel() for p in model.parameters())\n        \n        # Global magnitude pruning\n        parameters_to_prune = []\n        for name, module in model.named_modules():\n            if isinstance(module, (nn.Conv2d, nn.Linear)):\n                parameters_to_prune.append((module, 'weight'))\n        \n        # Prune 20% of connections globally\n        prune.global_unstructured(\n            parameters_to_prune,\n            pruning_method=prune.L1Unstructured,\n            amount=0.2,\n        )\n        \n        # Calculate compressed model size\n        compressed_size = sum(p.numel() for p in model.parameters() if p.requires_grad)\n        \n        print(f\"📊 Original model parameters: {original_size:,}\")\n        print(f\"📊 Compressed model parameters: {compressed_size:,}\")\n        print(f\"📊 Compression ratio: {original_size/compressed_size:.2f}x\")\n        print(f\"📊 Parameter reduction: {(1-compressed_size/original_size)*100:.1f}%\")\n        \n        # Remove pruning masks to make pruning permanent\n        for module, param_name in parameters_to_prune:\n            prune.remove(module, param_name)\n        \n        print(\"✅ Model pruning completed!\")\n        print(\"💡 Pruned models typically need fine-tuning to recover performance\")\n        \n        return model\n    \n    def gradient_accumulation_training(self, model, train_loader, test_loader, \n                                      accumulation_steps=4, epochs=10):\n        \"\"\"Training with gradient accumulation for large effective batch sizes\"\"\"\n        print(f\"🔄 Gradient Accumulation Training\")\n        print(f\"   Effective batch size: {train_loader.batch_size * accumulation_steps}\")\n        print(\"=\" * 33)\n        \n        criterion = nn.CrossEntropyLoss()\n        optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=1e-4)\n        \n        for epoch in range(epochs):\n            model.train()\n            running_loss = 0.0\n            \n            progress_bar = tqdm(train_loader, desc=f'Epoch {epoch+1}/{epochs} (GradAccum)')\n            \n            for batch_idx, (data, target) in enumerate(progress_bar):\n                data, target = data.to(self.device), target.to(self.device)\n                \n                with autocast(device_type='cuda' if torch.cuda.is_available() else 'cpu'):\n                    output = model(data)\n                    loss = criterion(output, target) / accumulation_steps  # Scale loss\n                \n                self.scaler.scale(loss).backward()\n                \n                if (batch_idx + 1) % accumulation_steps == 0:\n                    self.scaler.step(optimizer)\n                    self.scaler.update()\n                    optimizer.zero_grad()\n                \n                running_loss += loss.item() * accumulation_steps\n                \n                progress_bar.set_postfix({\n                    'Loss': f'{running_loss/(batch_idx+1):.4f}',\n                    'Step': f'{(batch_idx+1)//accumulation_steps}'\n                })\n            \n            print(f'Epoch {epoch+1} completed - Avg Loss: {running_loss/len(train_loader):.4f}')\n        \n        print(\"✅ Gradient accumulation training completed!\")\n        print(\"💡 Allows training with larger effective batch sizes on limited GPU memory\")\n        \n        return model

# Create instance
advanced_trainer = AdvancedTraining()
print("⚡ Advanced Training module initialized!")

# 🔬 Level 4: Super Advanced - Research-Grade Techniques

Welcome to the cutting edge! This section covers the most advanced techniques used in research and production at scale.

In [None]:
class SuperAdvancedTraining:
    """
    Level 4: Super Advanced - Research-grade techniques
    Including NAS, meta-learning, self-supervised learning, and distributed training
    """
    
    def __init__(self):
        self.device = device
        
    def neural_architecture_search(self, search_space_size=50, epochs_per_architecture=5):
        """Simple Neural Architecture Search implementation"""
        print("🔍 Neural Architecture Search (NAS)")
        print("=" * 35)
        
        class SearchableConvBlock(nn.Module):
            \"\"\"Searchable convolutional block with different options\"\"\"\n            def __init__(self, in_channels, out_channels):\n                super().__init__()\n                # Different kernel sizes to search over\n                self.conv3x3 = nn.Conv2d(in_channels, out_channels, 3, padding=1)\n                self.conv5x5 = nn.Conv2d(in_channels, out_channels, 5, padding=2)\n                self.conv1x1 = nn.Conv2d(in_channels, out_channels, 1)\n                self.bn = nn.BatchNorm2d(out_channels)\n                self.relu = nn.ReLU(inplace=True)\n                \n            def forward(self, x, arch_choice):\n                if arch_choice == 0:\n                    x = self.conv3x3(x)\n                elif arch_choice == 1:\n                    x = self.conv5x5(x)\n                else:\n                    x = self.conv1x1(x)\n                return self.relu(self.bn(x))\n        \n        class NASModel(nn.Module):\n            \"\"\"Model with searchable architecture\"\"\"\n            def __init__(self, num_classes=10):\n                super().__init__()\n                self.layer1 = SearchableConvBlock(3, 64)\n                self.layer2 = SearchableConvBlock(64, 128)\n                self.layer3 = SearchableConvBlock(128, 256)\n                self.pool = nn.AdaptiveAvgPool2d(1)\n                self.fc = nn.Linear(256, num_classes)\n                \n            def forward(self, x, architecture):\n                x = self.layer1(x, architecture[0])\n                x = F.max_pool2d(x, 2)\n                x = self.layer2(x, architecture[1])\n                x = F.max_pool2d(x, 2)\n                x = self.layer3(x, architecture[2])\n                x = self.pool(x)\n                x = x.view(x.size(0), -1)\n                x = self.fc(x)\n                return x\n        \n        def evaluate_architecture(architecture, model, train_loader, test_loader):\n            \"\"\"Evaluate a specific architecture\"\"\"\n            criterion = nn.CrossEntropyLoss()\n            optimizer = optim.Adam(model.parameters(), lr=0.001)\n            \n            # Quick training\n            model.train()\n            for epoch in range(epochs_per_architecture):\n                for batch_idx, (data, target) in enumerate(train_loader):\n                    if batch_idx > 20:  # Limit batches for speed\n                        break\n                    data, target = data.to(self.device), target.to(self.device)\n                    optimizer.zero_grad()\n                    output = model(data, architecture)\n                    loss = criterion(output, target)\n                    loss.backward()\n                    optimizer.step()\n            \n            # Quick evaluation\n            model.eval()\n            correct = 0\n            total = 0\n            with torch.no_grad():\n                for batch_idx, (data, target) in enumerate(test_loader):\n                    if batch_idx > 10:  # Limit batches for speed\n                        break\n                    data, target = data.to(self.device), target.to(self.device)\n                    output = model(data, architecture)\n                    _, predicted = output.max(1)\n                    total += target.size(0)\n                    correct += predicted.eq(target).sum().item()\n            \n            return 100. * correct / total\n        \n        # Create small dataset for NAS demo\n        transform = transforms.Compose([\n            transforms.ToTensor(),\n            transforms.Normalize((0.5,), (0.5,))\n        ])\n        \n        print(\"📥 Loading small dataset for NAS demo...\")\n        trainset = torchvision.datasets.CIFAR10(root='./data', train=True, transform=transform)\n        testset = torchvision.datasets.CIFAR10(root='./data', train=False, transform=transform)\n        \n        # Use smaller subsets for speed\n        train_subset = torch.utils.data.Subset(trainset, range(1000))\n        test_subset = torch.utils.data.Subset(testset, range(200))\n        \n        train_loader = DataLoader(train_subset, batch_size=64, shuffle=True)\n        test_loader = DataLoader(test_subset, batch_size=64, shuffle=False)\n        \n        # Search over architectures\n        best_architecture = None\n        best_accuracy = 0.0\n        architecture_results = []\n        \n        print(f\"🔍 Searching over {search_space_size} architectures...\")\n        \n        for i in tqdm(range(search_space_size), desc=\"Architecture Search\"):\n            # Random architecture: [layer1_choice, layer2_choice, layer3_choice]\n            architecture = [random.randint(0, 2) for _ in range(3)]\n            \n            # Create and evaluate model\n            model = NASModel().to(self.device)\n            accuracy = evaluate_architecture(architecture, model, train_loader, test_loader)\n            \n            architecture_results.append((architecture, accuracy))\n            \n            if accuracy > best_accuracy:\n                best_accuracy = accuracy\n                best_architecture = architecture\n        \n        # Results\n        print(f\"\\n✅ Neural Architecture Search completed!\")\n        print(f\"🏆 Best architecture: {best_architecture}\")\n        print(f\"📊 Best accuracy: {best_accuracy:.2f}%\")\n        \n        # Analyze results\n        arch_choices = ['3x3 Conv', '5x5 Conv', '1x1 Conv']\n        print(f\"\\n📊 Architecture breakdown:\")\n        for i, choice in enumerate(best_architecture):\n            print(f\"   Layer {i+1}: {arch_choices[choice]}\")\n        \n        return best_architecture, architecture_results\n    \n    def meta_learning_few_shot(self, n_way=5, k_shot=1, query_shots=15):\n        \"\"\"Model-Agnostic Meta-Learning (MAML) for few-shot learning\"\"\"\n        print(f\"🧠 Meta-Learning: {n_way}-way {k_shot}-shot Learning\")\n        print(\"=\" * 45)\n        \n        class SimpleMAMLModel(nn.Module):\n            \"\"\"Simple model for MAML demonstration\"\"\"\n            def __init__(self, input_size=84*84*3, hidden_size=128, output_size=5):\n                super().__init__()\n                self.net = nn.Sequential(\n                    nn.Linear(input_size, hidden_size),\n                    nn.ReLU(),\n                    nn.Linear(hidden_size, hidden_size),\n                    nn.ReLU(),\n                    nn.Linear(hidden_size, output_size)\n                )\n                \n            def forward(self, x):\n                return self.net(x.view(x.size(0), -1))\n        \n        def create_few_shot_task(dataset, n_way, k_shot, query_shots):\n            \"\"\"Create a few-shot learning task\"\"\"\n            # Simulate few-shot task creation\n            classes = random.sample(range(10), n_way)  # Select n_way classes\n            \n            support_data = []\n            support_labels = []\n            query_data = []\n            query_labels = []\n            \n            for i, class_idx in enumerate(classes):\n                # Simulate k-shot support set\n                for _ in range(k_shot):\n                    # Create random data (in practice, this would be real data)\n                    data = torch.randn(3, 84, 84)\n                    support_data.append(data)\n                    support_labels.append(i)\n                \n                # Create query set\n                for _ in range(query_shots):\n                    data = torch.randn(3, 84, 84)\n                    query_data.append(data)\n                    query_labels.append(i)\n            \n            support_data = torch.stack(support_data)\n            support_labels = torch.tensor(support_labels)\n            query_data = torch.stack(query_data)\n            query_labels = torch.tensor(query_labels)\n            \n            return (support_data, support_labels), (query_data, query_labels)\n        \n        def maml_inner_update(model, support_data, support_labels, lr_inner=0.01):\n            \"\"\"Inner loop update for MAML\"\"\"\n            # Create a copy of the model for inner updates\n            model_copy = type(model)()\n            model_copy.load_state_dict(model.state_dict())\n            model_copy = model_copy.to(self.device)\n            \n            criterion = nn.CrossEntropyLoss()\n            \n            # Inner gradient update\n            support_data = support_data.to(self.device)\n            support_labels = support_labels.to(self.device)\n            \n            output = model_copy(support_data)\n            loss = criterion(output, support_labels)\n            \n            # Compute gradients\n            grads = torch.autograd.grad(loss, model_copy.parameters(), create_graph=True)\n            \n            # Manual parameter update\n            for param, grad in zip(model_copy.parameters(), grads):\n                param.data = param.data - lr_inner * grad\n            \n            return model_copy\n        \n        # Initialize meta-model\n        meta_model = SimpleMAMLModel().to(self.device)\n        meta_optimizer = optim.Adam(meta_model.parameters(), lr=0.001)\n        criterion = nn.CrossEntropyLoss()\n        \n        print(f\"🚀 Starting MAML training...\")\n        \n        # Meta-training loop\n        n_meta_epochs = 20\n        tasks_per_epoch = 10\n        \n        for epoch in range(n_meta_epochs):\n            meta_loss = 0.0\n            \n            for task in range(tasks_per_epoch):\n                # Sample a task\n                (support_data, support_labels), (query_data, query_labels) = create_few_shot_task(\n                    None, n_way, k_shot, query_shots\n                )\n                \n                # Inner loop: adapt to support set\n                adapted_model = maml_inner_update(meta_model, support_data, support_labels)\n                \n                # Outer loop: evaluate on query set\n                query_data = query_data.to(self.device)\n                query_labels = query_labels.to(self.device)\n                \n                query_output = adapted_model(query_data)\n                task_loss = criterion(query_output, query_labels)\n                \n                meta_loss += task_loss\n            \n            # Meta-update\n            meta_optimizer.zero_grad()\n            meta_loss.backward()\n            meta_optimizer.step()\n            \n            if (epoch + 1) % 5 == 0:\n                print(f\"Meta-Epoch {epoch+1}: Meta-Loss = {meta_loss.item()/tasks_per_epoch:.4f}\")\n        \n        print(\"✅ Meta-learning training completed!\")\n        print(\"💡 Model can now quickly adapt to new tasks with few examples\")\n        \n        return meta_model\n    \n    def self_supervised_learning(self, train_loader, epochs=20):\n        \"\"\"Self-supervised learning with SimCLR-style contrastive learning\"\"\"\n        print(\"🔄 Self-Supervised Learning (Contrastive)\")\n        print(\"=\" * 40)\n        \n        class ContrastiveAugmentation:\n            \"\"\"Data augmentation for contrastive learning\"\"\"\n            def __init__(self):\n                self.transform = transforms.Compose([\n                    transforms.RandomResizedCrop(32, scale=(0.8, 1.0)),\n                    transforms.RandomHorizontalFlip(),\n                    transforms.ColorJitter(0.4, 0.4, 0.4, 0.1),\n                    transforms.RandomGrayscale(p=0.2),\n                    transforms.ToTensor(),\n                    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))\n                ])\n            \n            def __call__(self, x):\n                return self.transform(x), self.transform(x)\n        \n        class SimCLRProjector(nn.Module):\n            \"\"\"Projection head for SimCLR\"\"\"\n            def __init__(self, input_dim=512, hidden_dim=512, output_dim=128):\n                super().__init__()\n                self.net = nn.Sequential(\n                    nn.Linear(input_dim, hidden_dim),\n                    nn.ReLU(inplace=True),\n                    nn.Linear(hidden_dim, output_dim)\n                )\n                \n            def forward(self, x):\n                return self.net(x)\n        \n        class ContrastiveLoss(nn.Module):\n            \"\"\"NT-Xent loss for contrastive learning\"\"\"\n            def __init__(self, temperature=0.07):\n                super().__init__()\n                self.temperature = temperature\n                \n            def forward(self, z1, z2):\n                batch_size = z1.size(0)\n                \n                # Normalize embeddings\n                z1 = F.normalize(z1, dim=1)\n                z2 = F.normalize(z2, dim=1)\n                \n                # Concatenate positive pairs\n                representations = torch.cat([z1, z2], dim=0)\n                \n                # Compute similarity matrix\n                similarity_matrix = torch.matmul(representations, representations.T)\n                \n                # Create labels for positive pairs\n                labels = torch.cat([torch.arange(batch_size) + batch_size, \n                                   torch.arange(batch_size)], dim=0)\n                labels = labels.to(self.device)\n                \n                # Remove self-similarity\n                mask = torch.eye(2 * batch_size, dtype=bool).to(self.device)\n                similarity_matrix.masked_fill_(mask, -9e15)\n                \n                # Apply temperature\n                similarity_matrix = similarity_matrix / self.temperature\n                \n                # Compute cross-entropy loss\n                loss = F.cross_entropy(similarity_matrix, labels)\n                \n                return loss\n        \n        # Create encoder (ResNet backbone)\n        from torchvision.models import resnet18\n        encoder = resnet18(pretrained=False)\n        encoder.fc = nn.Identity()  # Remove classification head\n        encoder = encoder.to(self.device)\n        \n        # Add projection head\n        projector = SimCLRProjector(512, 512, 128).to(self.device)\n        \n        # Combine encoder and projector\n        model = nn.Sequential(encoder, projector)\n        \n        optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-6)\n        criterion = ContrastiveLoss().to(self.device)\n        \n        print(f\"🚀 Starting self-supervised training...\")\n        \n        # Create contrastive dataset\n        augmentation = ContrastiveAugmentation()\n        \n        for epoch in range(epochs):\n            model.train()\n            total_loss = 0.0\n            \n            progress_bar = tqdm(train_loader, desc=f'SSL Epoch {epoch+1}/{epochs}')\n            \n            for batch_idx, (data, _) in enumerate(progress_bar):\n                if batch_idx > 50:  # Limit for demo\n                    break\n                    \n                # Create augmented pairs\n                data1_list = []\n                data2_list = []\n                \n                for img in data:\n                    img_pil = transforms.ToPILImage()(img)\n                    aug1, aug2 = augmentation(img_pil)\n                    data1_list.append(aug1)\n                    data2_list.append(aug2)\n                \n                data1 = torch.stack(data1_list).to(self.device)\n                data2 = torch.stack(data2_list).to(self.device)\n                \n                optimizer.zero_grad()\n                \n                # Forward pass\n                z1 = model(data1)\n                z2 = model(data2)\n                \n                # Contrastive loss\n                loss = criterion(z1, z2)\n                \n                loss.backward()\n                optimizer.step()\n                \n                total_loss += loss.item()\n                \n                progress_bar.set_postfix({'Loss': f'{loss.item():.4f}'})\n            \n            avg_loss = total_loss / min(len(train_loader), 50)\n            print(f'Epoch {epoch+1}: Average Loss = {avg_loss:.4f}')\n        \n        print(\"✅ Self-supervised learning completed!\")\n        print(\"💡 Learned representations can be fine-tuned for downstream tasks\")\n        \n        return model[0]  # Return encoder without projection head\n    \n    def distributed_training_setup(self):\n        \"\"\"Setup for distributed training across multiple GPUs/nodes\"\"\"\n        print(\"🌐 Distributed Training Setup\")\n        print(\"=\" * 28)\n        \n        if not torch.cuda.is_available():\n            print(\"❌ CUDA not available - distributed training requires GPUs\")\n            return None\n        \n        if torch.cuda.device_count() < 2:\n            print(\"⚠️  Only one GPU detected - simulating distributed setup\")\n            print(\"💡 For real distributed training, use multiple GPUs or nodes\")\n        \n        # Distributed training simulation\n        class DistributedTrainer:\n            def __init__(self, model, rank=0, world_size=1):\n                self.rank = rank\n                self.world_size = world_size\n                self.device = torch.device(f'cuda:{rank}' if torch.cuda.is_available() else 'cpu')\n                \n                # Wrap model with DDP (simulation)\n                self.model = model.to(self.device)\n                if world_size > 1 and torch.cuda.is_available():\n                    self.model = nn.DataParallel(model)\n                \n            def train_step(self, data, target, optimizer, criterion):\n                \"\"\"Single training step with gradient synchronization\"\"\"\n                data, target = data.to(self.device), target.to(self.device)\n                \n                optimizer.zero_grad()\n                output = self.model(data)\n                loss = criterion(output, target)\n                loss.backward()\n                \n                # In real DDP, gradients are automatically synchronized here\n                optimizer.step()\n                \n                return loss.item()\n            \n            def save_checkpoint(self, epoch, optimizer, loss, filename):\n                \"\"\"Save training checkpoint\"\"\"\n                checkpoint = {\n                    'epoch': epoch,\n                    'model_state_dict': self.model.state_dict(),\n                    'optimizer_state_dict': optimizer.state_dict(),\n                    'loss': loss,\n                    'rank': self.rank\n                }\n                torch.save(checkpoint, f'{filename}_rank{self.rank}.pth')\n        \n        print(\"✅ Distributed training setup completed!\")\n        print(\"📊 Key components:\")\n        print(\"   - Data Parallel / Distributed Data Parallel\")\n        print(\"   - Gradient synchronization across workers\")\n        print(\"   - Checkpoint saving and loading\")\n        print(\"   - Learning rate scaling for large batch sizes\")\n        \n        return DistributedTrainer\n    \n    def advanced_optimization_techniques(self):\n        \"\"\"Demonstrate advanced optimization techniques\"\"\"\n        print(\"🔧 Advanced Optimization Techniques\")\n        print(\"=\" * 35)\n        \n        class LookAhead(optim.Optimizer):\n            \"\"\"LookAhead optimizer wrapper\"\"\"\n            def __init__(self, base_optimizer, alpha=0.5, k=5):\n                self.base_optimizer = base_optimizer\n                self.alpha = alpha\n                self.k = k\n                self.param_groups = base_optimizer.param_groups\n                self.state = base_optimizer.state\n                self.slow_weights = {}\n                \n                for group in self.param_groups:\n                    for p in group['params']:\n                        self.slow_weights[p] = p.data.clone()\n            \n            def step(self, closure=None):\n                loss = self.base_optimizer.step(closure)\n                \n                # Update slow weights every k steps\n                if hasattr(self, 'step_count'):\n                    self.step_count += 1\n                else:\n                    self.step_count = 1\n                \n                if self.step_count % self.k == 0:\n                    for group in self.param_groups:\n                        for p in group['params']:\n                            slow_weight = self.slow_weights[p]\n                            slow_weight.data += self.alpha * (p.data - slow_weight.data)\n                            p.data = slow_weight.data.clone()\n                \n                return loss\n        \n        class SAM(optim.Optimizer):\n            \"\"\"Sharpness-Aware Minimization optimizer\"\"\"\n            def __init__(self, params, base_optimizer, rho=0.05, **kwargs):\n                self.rho = rho\n                self.base_optimizer = base_optimizer(params, **kwargs)\n                self.param_groups = self.base_optimizer.param_groups\n                \n            def first_step(self, zero_grad=False):\n                grad_norm = self._grad_norm()\n                for group in self.param_groups:\n                    scale = self.rho / (grad_norm + 1e-12)\n                    for p in group['params']:\n                        if p.grad is None:\n                            continue\n                        e_w = p.grad * scale.to(p)\n                        p.add_(e_w)  # Climb to the local maximum\n                        self.state[p]['e_w'] = e_w\n                \n                if zero_grad:\n                    self.zero_grad()\n            \n            def second_step(self, zero_grad=False):\n                for group in self.param_groups:\n                    for p in group['params']:\n                        if p.grad is None:\n                            continue\n                        p.sub_(self.state[p]['e_w'])  # Get back to the original point\n                \n                self.base_optimizer.step()\n                \n                if zero_grad:\n                    self.zero_grad()\n            \n            def _grad_norm(self):\n                shared_device = self.param_groups[0]['params'][0].device\n                norm = torch.norm(\n                    torch.stack([\n                        p.grad.norm(dtype=torch.float32).to(shared_device)\n                        for group in self.param_groups for p in group['params']\n                        if p.grad is not None\n                    ]),\n                    dtype=torch.float32\n                )\n                return norm\n        \n        # Demonstrate custom learning rate scheduler\n        class WarmupCosineScheduler:\n            \"\"\"Warmup + Cosine Annealing scheduler\"\"\"\n            def __init__(self, optimizer, warmup_epochs, total_epochs, base_lr, max_lr):\n                self.optimizer = optimizer\n                self.warmup_epochs = warmup_epochs\n                self.total_epochs = total_epochs\n                self.base_lr = base_lr\n                self.max_lr = max_lr\n                self.current_epoch = 0\n            \n            def step(self):\n                if self.current_epoch < self.warmup_epochs:\n                    # Linear warmup\n                    lr = self.base_lr + (self.max_lr - self.base_lr) * self.current_epoch / self.warmup_epochs\n                else:\n                    # Cosine annealing\n                    progress = (self.current_epoch - self.warmup_epochs) / (self.total_epochs - self.warmup_epochs)\n                    lr = self.base_lr + (self.max_lr - self.base_lr) * 0.5 * (1 + np.cos(np.pi * progress))\n                \n                for param_group in self.optimizer.param_groups:\n                    param_group['lr'] = lr\n                \n                self.current_epoch += 1\n                return lr\n        \n        print(\"✅ Advanced optimization techniques implemented:\")\n        print(\"   🔍 LookAhead: Improves convergence stability\")\n        print(\"   📈 SAM: Finds flatter minima for better generalization\")\n        print(\"   🌡️  Warmup + Cosine: Better learning rate scheduling\")\n        print(\"   ⚡ AdamW + Weight Decay: Improved regularization\")\n        \n        return LookAhead, SAM, WarmupCosineScheduler

# Create instance
super_advanced_trainer = SuperAdvancedTraining()
print("🔬 Super Advanced Training module initialized!")

In [None]:
# 🎯 COMPREHENSIVE DEMONSTRATION: All Levels
print("🚀 RUNNING COMPREHENSIVE MODEL TRAINING DEMONSTRATIONS")
print("=" * 60)

# Level 1: Easy Training Demo
print("\\n" + "="*50)
print("🌱 LEVEL 1: EASY TRAINING")
print("="*50)
easy_trainer.create_simple_dataset()

# Level 2: Intermediate Training (Quick Demo)
print("\\n" + "="*50)
print("🚀 LEVEL 2: INTERMEDIATE TRAINING")
print("="*50)
# Create smaller dataset for quick demo
transform_quick = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

print("📥 Loading CIFAR-10 for quick intermediate demo...")
trainset_quick = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform_quick)
testset_quick = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform_quick)

# Use smaller subsets
train_subset = torch.utils.data.Subset(trainset_quick, range(2000))
test_subset = torch.utils.data.Subset(testset_quick, range(500))

quick_train_loader = DataLoader(train_subset, batch_size=64, shuffle=True)
quick_test_loader = DataLoader(test_subset, batch_size=64, shuffle=False)

# Create and train quick CNN
print("🏗️  Creating quick CNN for demo...")
quick_cnn = intermediate_trainer.create_custom_cnn(num_classes=10)

# Level 3: Advanced Training Demo
print("\\n" + "="*50)
print("⚡ LEVEL 3: ADVANCED TRAINING")
print("="*50)

# Demonstrate custom loss functions
print("\\n🔧 Custom Loss Functions Demo:")
FocalLoss, LabelSmoothingLoss, CutMixLoss = advanced_trainer.custom_loss_functions()

# Quick mixed precision demo
print("\\n🚀 Mixed Precision Training Demo (Quick):")
quick_model_copy = intermediate_trainer.create_custom_cnn(num_classes=10)
advanced_trainer.mixed_precision_training(quick_model_copy, quick_train_loader, quick_test_loader, epochs=3)

# Model compression demo
print("\\n✂️  Model Compression Demo:")
compression_model = intermediate_trainer.create_custom_cnn(num_classes=10)
compressed_model = advanced_trainer.model_compression_pruning(compression_model)

# Level 4: Super Advanced Training Demo
print("\\n" + "="*50)
print("🔬 LEVEL 4: SUPER ADVANCED TRAINING")
print("="*50)

# Neural Architecture Search Demo (Quick)
print("\\n🔍 Neural Architecture Search Demo:")
best_arch, arch_results = super_advanced_trainer.neural_architecture_search(
    search_space_size=10, epochs_per_architecture=2
)

# Meta-Learning Demo (Quick)
print("\\n🧠 Meta-Learning Demo:")
meta_model = super_advanced_trainer.meta_learning_few_shot(n_way=3, k_shot=1, query_shots=5)

# Self-Supervised Learning Demo (Quick)
print("\\n🔄 Self-Supervised Learning Demo:")
ssl_encoder = super_advanced_trainer.self_supervised_learning(quick_train_loader, epochs=3)

# Distributed Training Setup
print("\\n🌐 Distributed Training Setup:")
DistributedTrainer = super_advanced_trainer.distributed_training_setup()

# Advanced Optimization
print("\\n🔧 Advanced Optimization Techniques:")
LookAhead, SAM, WarmupCosineScheduler = super_advanced_trainer.advanced_optimization_techniques()

print("\\n" + "="*60)
print("🎉 ALL DEMONSTRATIONS COMPLETED!")
print("="*60)
print("\\n📚 Summary of what you've learned:")
print("🌱 Easy: Basic neural networks and training loops")
print("🚀 Intermediate: CNNs, real datasets, advanced training")
print("⚡ Advanced: Multi-GPU, mixed precision, custom components")
print("🔬 Super Advanced: NAS, meta-learning, self-supervised learning")
print("\\n💡 You now have the complete toolkit for state-of-the-art model training!")
print("🚀 Ready to tackle any computer vision challenge!")