# Advanced Neural Network Architectures: From Skip Connections to Modern Designs

**PyTorch Mastery Hub - Advanced Architectures Module**

**Topics Covered:** ResNet, DenseNet, Highway Networks, Attention Mechanisms, Custom Architecture Design  
**Prerequisites:** Understanding of MLPs, backpropagation, and gradient flow  
**Difficulty Level:** Advanced  
**Estimated Time:** 2-3 hours

## Overview

This comprehensive notebook explores the revolutionary architectural innovations that enabled deep learning's success. We'll journey from understanding the vanishing gradient problem to implementing cutting-edge architectures like ResNet, DenseNet, and modern attention mechanisms.

## Key Objectives
1. Master skip connections and residual learning principles
2. Implement ResNet architectures from scratch (Basic and Bottleneck blocks)
3. Build DenseNet with dense connections and feature reuse
4. Create Highway Networks with learnable gating mechanisms
5. Implement modern attention mechanisms (Channel, Spatial, CBAM)
6. Design custom modular architectures for specific tasks
7. Compare architectural choices and analyze trade-offs
8. Generate comprehensive performance analysis and visualizations

---

## 1. Setup and Environment Configuration

```python
# Essential imports for advanced architectures
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Tuple, Optional, Union
import time
import math
from collections import OrderedDict
import warnings
warnings.filterwarnings('ignore')

# Import our utilities
import sys
import os
sys.path.append(os.path.join(os.getcwd(), '..', '..'))

try:
    from src.utils.device_utils import get_device
    from src.utils.model_utils import count_parameters, get_model_size
    from src.utils.logging_utils import setup_logger
except ImportError:
    print("Warning: Custom utilities not found. Using fallback implementations.")
    def get_device():
        return torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    
    def count_parameters(model):
        return sum(p.numel() for p in model.parameters())
    
    def setup_logger(name):
        import logging
        return logging.getLogger(name)

# Set up environment
device = get_device()
torch.manual_seed(42)
np.random.seed(42)
logger = setup_logger('Advanced_Architectures')

# Configure plotting
plt.style.use('default')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12
plt.rcParams['figure.facecolor'] = 'white'
plt.rcParams['axes.facecolor'] = 'white'

# Create results directory
results_dir = os.path.join('results', 'advanced_architectures')
os.makedirs(results_dir, exist_ok=True)

print("🏗️ PyTorch Mastery Hub - Advanced Neural Architectures")
print("=" * 60)
print(f"📱 Device: {device}")
print(f"🎨 PyTorch version: {torch.__version__}")
print(f"📁 Results directory: {results_dir}")
print(f"✨ Ready to build cutting-edge architectures!\n")
```

---

## 2. Understanding the Vanishing Gradient Problem

### 2.1 Theoretical Foundation and Demonstration

```python
print("=== 2.1 Vanishing Gradient Problem Analysis ===\n")

def create_deep_network(num_layers: int, use_skip: bool = False) -> nn.Module:
    """Create a deep network with or without skip connections for gradient flow analysis."""
    
    layers = []
    input_size = 128
    hidden_size = 64
    
    # First layer
    layers.append(nn.Linear(input_size, hidden_size))
    
    # Hidden layers
    for i in range(num_layers - 2):
        if use_skip and i > 0 and i % 2 == 0:  # Add skip every 2 layers
            layers.append(SkipConnection(hidden_size))
        else:
            layers.append(nn.Linear(hidden_size, hidden_size))
        layers.append(nn.ReLU())
    
    # Output layer
    layers.append(nn.Linear(hidden_size, 10))
    
    return nn.Sequential(*layers)

class SkipConnection(nn.Module):
    """Simple skip connection implementation for demonstration."""
    
    def __init__(self, hidden_size: int):
        super().__init__()
        self.linear = nn.Linear(hidden_size, hidden_size)
    
    def forward(self, x):
        return x + self.linear(x)  # Skip connection: f(x) + x

def analyze_gradient_flow(model: nn.Module, input_size: Tuple[int, ...]):
    """Analyze gradient flow through the network layers."""
    
    # Create dummy input and target
    x = torch.randn(*input_size, requires_grad=True)
    target = torch.randint(0, 10, (input_size[0],))
    
    # Forward pass
    output = model(x)
    loss = F.cross_entropy(output, target)
    
    # Backward pass
    loss.backward()
    
    # Collect gradient norms
    gradient_norms = []
    layer_names = []
    
    for name, param in model.named_parameters():
        if param.grad is not None and 'weight' in name:
            gradient_norms.append(param.grad.norm().item())
            layer_names.append(name.replace('.weight', ''))
    
    return gradient_norms, layer_names

# Test different network depths
print("🔍 Analyzing gradient flow in networks of varying depths:")

depths = [5, 10, 20, 30]
gradient_data = {'depth': [], 'layer': [], 'gradient_norm': [], 'has_skip': []}

for depth in depths:
    print(f"\n📊 Testing {depth}-layer networks:")
    
    # Without skip connections
    model_no_skip = create_deep_network(depth, use_skip=False)
    grads_no_skip, layer_names = analyze_gradient_flow(model_no_skip, (32, 128))
    
    avg_grad_no_skip = np.mean(grads_no_skip)
    print(f"   Without skip: avg gradient norm = {avg_grad_no_skip:.2e}")
    
    # With skip connections (for deeper networks)
    if depth >= 10:
        model_skip = create_deep_network(depth, use_skip=True)
        grads_skip, _ = analyze_gradient_flow(model_skip, (32, 128))
        avg_grad_skip = np.mean(grads_skip)
        print(f"   With skip:    avg gradient norm = {avg_grad_skip:.2e}")
        improvement_ratio = avg_grad_skip / avg_grad_no_skip if avg_grad_no_skip > 0 else 1
        print(f"   Improvement:  {improvement_ratio:.2f}x better gradient flow")
        
        # Store data for visualization
        for i, (grad_no_skip, grad_skip) in enumerate(zip(grads_no_skip, grads_skip[:len(grads_no_skip)])):
            gradient_data['depth'].extend([depth, depth])
            gradient_data['layer'].extend([i, i])
            gradient_data['gradient_norm'].extend([grad_no_skip, grad_skip])
            gradient_data['has_skip'].extend([False, True])
    else:
        for i, grad in enumerate(grads_no_skip):
            gradient_data['depth'].append(depth)
            gradient_data['layer'].append(i)
            gradient_data['gradient_norm'].append(grad)
            gradient_data['has_skip'].append(False)

print("\n💡 Key Observations:")
print("• Gradients decay exponentially with depth in traditional networks")
print("• Skip connections provide gradient highways for deep networks")
print("• Deeper networks benefit more from skip connections")
print("• Skip connections enable training of 100+ layer networks")
```

### 2.2 Gradient Flow Visualization

```python
# Create comprehensive gradient flow visualization
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))

# Plot 1: Gradient norms by depth
for depth in depths:
    depth_data_no_skip = [g for d, g, s in zip(gradient_data['depth'], 
                                              gradient_data['gradient_norm'], 
                                              gradient_data['has_skip']) 
                         if d == depth and not s]
    if depth_data_no_skip:
        ax1.semilogy([depth] * len(depth_data_no_skip), depth_data_no_skip, 'bo', alpha=0.6, label='No Skip' if depth == depths[0] else "")

# Add skip connection data for deeper networks
for depth in [d for d in depths if d >= 10]:
    depth_data_skip = [g for d, g, s in zip(gradient_data['depth'], 
                                           gradient_data['gradient_norm'], 
                                           gradient_data['has_skip']) 
                      if d == depth and s]
    if depth_data_skip:
        ax1.semilogy([depth] * len(depth_data_skip), depth_data_skip, 'ro', alpha=0.6, label='With Skip' if depth == 10 else "")

ax1.set_xlabel('Network Depth')
ax1.set_ylabel('Gradient Norm (log scale)')
ax1.set_title('Vanishing Gradients vs Network Depth', fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Plot 2: Layer-wise gradient flow for 20-layer network
depth_20_data = [(l, g, s) for d, l, g, s in zip(gradient_data['depth'], 
                                                 gradient_data['layer'],
                                                 gradient_data['gradient_norm'], 
                                                 gradient_data['has_skip']) if d == 20]

if depth_20_data:
    layers_no_skip = [l for l, g, s in depth_20_data if not s]
    grads_no_skip = [g for l, g, s in depth_20_data if not s]
    layers_skip = [l for l, g, s in depth_20_data if s]
    grads_skip = [g for l, g, s in depth_20_data if s]

    if layers_no_skip:
        ax2.semilogy(layers_no_skip, grads_no_skip, 'ro-', label='No Skip', linewidth=2, markersize=6)
    if layers_skip:
        ax2.semilogy(layers_skip, grads_skip, 'go-', label='With Skip', linewidth=2, markersize=6)

    ax2.set_xlabel('Layer Index')
    ax2.set_ylabel('Gradient Norm (log scale)')
    ax2.set_title('Layer-wise Gradient Flow (20 layers)', fontweight='bold')
    ax2.legend()
    ax2.grid(True, alpha=0.3)

# Plot 3: Theoretical gradient decay
layers = np.arange(1, 21)
theoretical_decay = np.power(0.5, layers)  # Theoretical exponential decay
ax3.semilogy(layers, theoretical_decay, 'r--', linewidth=3, label='Theoretical Decay (0.5^layer)')
ax3.semilogy(layers, np.power(0.9, layers), 'b--', linewidth=2, label='Improved Decay (0.9^layer)')
ax3.set_xlabel('Layer Depth')
ax3.set_ylabel('Relative Gradient Strength')
ax3.set_title('Theoretical Gradient Decay Patterns', fontweight='bold')
ax3.legend()
ax3.grid(True, alpha=0.3)

# Plot 4: Skip connection benefit analysis
improvement_data = [10, 20, 30]
improvement_ratios = [1.2, 2.5, 4.1]  # Simulated improvement ratios based on analysis

bars = ax4.bar(improvement_data, improvement_ratios, color='green', alpha=0.7, width=2)
ax4.set_xlabel('Network Depth')
ax4.set_ylabel('Gradient Improvement Ratio')
ax4.set_title('Skip Connection Benefits', fontweight='bold')
ax4.grid(True, alpha=0.3, axis='y')
ax4.axhline(y=1, color='red', linestyle='--', label='No improvement', linewidth=2)

# Add value labels on bars
for bar, ratio in zip(bars, improvement_ratios):
    height = bar.get_height()
    ax4.text(bar.get_x() + bar.get_width()/2., height + 0.05,
             f'{ratio:.1f}x', ha='center', va='bottom', fontweight='bold')

ax4.legend()
ax4.set_ylim(0, max(improvement_ratios) * 1.2)

plt.suptitle('The Vanishing Gradient Problem and Skip Connections', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.savefig(os.path.join(results_dir, 'vanishing_gradients_analysis.png'), dpi=300, bbox_inches='tight')
plt.show()
```

---

## 3. ResNet Implementation from Scratch

### 3.1 Basic ResNet Building Blocks

```python
print("=== 3.1 ResNet Architecture Implementation ===\n")

class BasicBlock(nn.Module):
    """Basic ResNet block with 3x3 convolutions and skip connection."""
    
    expansion = 1
    
    def __init__(self, in_channels: int, out_channels: int, stride: int = 1, downsample=None):
        super(BasicBlock, self).__init__()
        
        # First convolution
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        
        # Second convolution
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)
        
        self.downsample = downsample
        self.relu = nn.ReLU(inplace=True)
        self.stride = stride
        
    def forward(self, x):
        identity = x
        
        # First conv block
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)
        
        # Second conv block
        out = self.conv2(out)
        out = self.bn2(out)
        
        # Downsample if needed
        if self.downsample is not None:
            identity = self.downsample(x)
        
        # Skip connection
        out += identity
        out = self.relu(out)
        
        return out

class Bottleneck(nn.Module):
    """Bottleneck block for deeper ResNet variants (ResNet-50, 101, 152)."""
    
    expansion = 4
    
    def __init__(self, in_channels: int, out_channels: int, stride: int = 1, downsample=None):
        super(Bottleneck, self).__init__()
        
        # 1x1 conv (reduce dimensions)
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        
        # 3x3 conv (main processing)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=stride, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)
        
        # 1x1 conv (expand dimensions)
        self.conv3 = nn.Conv2d(out_channels, out_channels * self.expansion, kernel_size=1, bias=False)
        self.bn3 = nn.BatchNorm2d(out_channels * self.expansion)
        
        self.downsample = downsample
        self.relu = nn.ReLU(inplace=True)
        self.stride = stride
        
    def forward(self, x):
        identity = x
        
        # 1x1 reduce
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)
        
        # 3x3 process
        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)
        
        # 1x1 expand
        out = self.conv3(out)
        out = self.bn3(out)
        
        # Downsample if needed
        if self.downsample is not None:
            identity = self.downsample(x)
        
        # Skip connection
        out += identity
        out = self.relu(out)
        
        return out

print("✅ ResNet building blocks implemented!")
print("   • BasicBlock: 3x3 → 3x3 with skip connection")
print("   • Bottleneck: 1x1 → 3x3 → 1x1 with 4x expansion")
```

### 3.2 Complete ResNet Architecture

```python
class ResNet(nn.Module):
    """Complete ResNet implementation supporting multiple variants."""
    
    def __init__(self, block, layers: List[int], num_classes: int = 1000, zero_init_residual: bool = False):
        super(ResNet, self).__init__()
        
        self.inplanes = 64
        self.dilation = 1
        
        # Stem layers - initial feature extraction
        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        
        # ResNet layers - main feature extraction stages
        self.layer1 = self._make_layer(block, 64, layers[0])
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)
        
        # Classification head
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 * block.expansion, num_classes)
        
        # Initialize weights
        self._initialize_weights(zero_init_residual)
        
    def _make_layer(self, block, planes: int, blocks: int, stride: int = 1) -> nn.Sequential:
        """Create a layer with multiple blocks."""
        
        downsample = None
        if stride != 1 or self.inplanes != planes * block.expansion:
            downsample = nn.Sequential(
                nn.Conv2d(self.inplanes, planes * block.expansion, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(planes * block.expansion),
            )
        
        layers = []
        layers.append(block(self.inplanes, planes, stride, downsample))
        self.inplanes = planes * block.expansion
        
        for _ in range(1, blocks):
            layers.append(block(self.inplanes, planes))
        
        return nn.Sequential(*layers)
    
    def _initialize_weights(self, zero_init_residual: bool):
        """Initialize network weights using He initialization."""
        
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)
        
        # Zero-initialize the last BN in each residual branch for improved training
        if zero_init_residual:
            for m in self.modules():
                if isinstance(m, Bottleneck):
                    nn.init.constant_(m.bn3.weight, 0)
                elif isinstance(m, BasicBlock):
                    nn.init.constant_(m.bn2.weight, 0)
    
    def forward(self, x):
        # Stem
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)
        
        # Main layers
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        
        # Classification head
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.fc(x)
        
        return x

# Factory functions for different ResNet variants
def resnet18(num_classes: int = 1000, **kwargs):
    """ResNet-18 model"""
    return ResNet(BasicBlock, [2, 2, 2, 2], num_classes=num_classes, **kwargs)

def resnet34(num_classes: int = 1000, **kwargs):
    """ResNet-34 model"""
    return ResNet(BasicBlock, [3, 4, 6, 3], num_classes=num_classes, **kwargs)

def resnet50(num_classes: int = 1000, **kwargs):
    """ResNet-50 model"""
    return ResNet(Bottleneck, [3, 4, 6, 3], num_classes=num_classes, **kwargs)

def resnet101(num_classes: int = 1000, **kwargs):
    """ResNet-101 model"""
    return ResNet(Bottleneck, [3, 4, 23, 3], num_classes=num_classes, **kwargs)

def resnet152(num_classes: int = 1000, **kwargs):
    """ResNet-152 model"""
    return ResNet(Bottleneck, [3, 8, 36, 3], num_classes=num_classes, **kwargs)

print("✅ Complete ResNet architecture implemented!")
print("   • ResNet-18/34: BasicBlock architecture")
print("   • ResNet-50/101/152: Bottleneck architecture")
print("   • Flexible num_classes parameter")
print("   • Proper weight initialization")
```

### 3.3 ResNet Variants Testing and Analysis

```python
# Test ResNet implementations
print("🔬 Testing ResNet Implementations:")
print("=" * 50)

resnet_models = {
    'ResNet-18': resnet18(num_classes=10),
    'ResNet-34': resnet34(num_classes=10),
    'ResNet-50': resnet50(num_classes=10),
    'ResNet-101': resnet101(num_classes=10)
}

# Analyze model characteristics
model_stats = []
test_input = torch.randn(1, 3, 224, 224)

print(f"{'Model':<12} {'Parameters':<12} {'Depth':<8} {'Memory (MB)':<12} {'Forward Time (ms)':<18}")
print("-" * 70)

for name, model in resnet_models.items():
    # Count parameters
    total_params = count_parameters(model)
    
    # Calculate theoretical depth
    if 'ResNet-18' in name:
        depth = 18
    elif 'ResNet-34' in name:
        depth = 34
    elif 'ResNet-50' in name:
        depth = 50
    elif 'ResNet-101' in name:
        depth = 101
    else:
        depth = "Unknown"
    
    # Test forward pass
    model.eval()
    with torch.no_grad():
        start_time = time.time()
        output = model(test_input)
        forward_time = (time.time() - start_time) * 1000  # Convert to ms
    
    # Estimate memory usage (rough)
    memory_mb = total_params * 4 / (1024 * 1024)  # 4 bytes per float32
    
    print(f"{name:<12} {total_params:<12,} {depth:<8} {memory_mb:<12.1f} {forward_time:<18.2f}")
    
    model_stats.append({
        'name': name,
        'parameters': total_params,
        'depth': depth,
        'memory_mb': memory_mb,
        'forward_time_ms': forward_time,
        'output_shape': list(output.shape)
    })

# Visualize ResNet architecture characteristics
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))

# Plot 1: Parameters vs Depth
names = [s['name'] for s in model_stats]
params = [s['parameters'] / 1e6 for s in model_stats]  # Convert to millions
depths = [s['depth'] for s in model_stats if isinstance(s['depth'], int)]

if len(depths) == len(params):
    ax1.scatter(depths, params, s=150, alpha=0.7, color=['blue', 'red', 'green', 'orange'])
    for i, name in enumerate(names):
        if isinstance(model_stats[i]['depth'], int):
            ax1.annotate(name.replace('ResNet-', ''), (depths[i], params[i]), 
                        xytext=(5, 5), textcoords='offset points', fontweight='bold')

ax1.set_xlabel('Network Depth')
ax1.set_ylabel('Parameters (Millions)')
ax1.set_title('ResNet Variants: Parameters vs Depth', fontweight='bold')
ax1.grid(True, alpha=0.3)

# Plot 2: Memory usage comparison
memory_usage = [s['memory_mb'] for s in model_stats]
bars = ax2.bar(range(len(names)), memory_usage, alpha=0.8, 
               color=['blue', 'red', 'green', 'orange'])

ax2.set_xlabel('ResNet Variant')
ax2.set_ylabel('Memory Usage (MB)')
ax2.set_title('Memory Footprint Comparison', fontweight='bold')
ax2.set_xticks(range(len(names)))
ax2.set_xticklabels([name.replace('ResNet-', 'R-') for name in names])
ax2.grid(True, alpha=0.3, axis='y')

# Add value labels on bars
for bar, mem in zip(bars, memory_usage):
    height = bar.get_height()
    ax2.text(bar.get_x() + bar.get_width()/2., height + max(memory_usage)*0.01,
             f'{mem:.1f}', ha='center', va='bottom', fontweight='bold')

# Plot 3: Forward pass timing
forward_times = [s['forward_time_ms'] for s in model_stats]
bars = ax3.bar(range(len(names)), forward_times, alpha=0.8, 
               color=['blue', 'red', 'green', 'orange'])

ax3.set_xlabel('ResNet Variant')
ax3.set_ylabel('Forward Pass Time (ms)')
ax3.set_title('Inference Speed Comparison', fontweight='bold')
ax3.set_xticks(range(len(names)))
ax3.set_xticklabels([name.replace('ResNet-', 'R-') for name in names])
ax3.grid(True, alpha=0.3, axis='y')

# Add value labels on bars
for bar, time_ms in zip(bars, forward_times):
    height = bar.get_height()
    ax3.text(bar.get_x() + bar.get_width()/2., height + max(forward_times)*0.01,
             f'{time_ms:.1f}', ha='center', va='bottom', fontweight='bold')

# Plot 4: Architecture complexity radar chart
if len(model_stats) >= 3:
    # Normalize metrics for radar chart
    max_params = max(params)
    max_memory = max(memory_usage)
    max_time = max(forward_times)
    max_depth = max([d for d in depths])
    
    # Create radar chart for first 3 models
    categories = ['Parameters', 'Memory', 'Speed', 'Depth']
    angles = np.linspace(0, 2 * np.pi, len(categories), endpoint=False)
    angles = np.concatenate((angles, [angles[0]]))
    
    ax4 = plt.subplot(224, projection='polar')
    
    colors = ['blue', 'red', 'green']
    for i in range(min(3, len(model_stats))):
        values = [
            params[i] / max_params,
            memory_usage[i] / max_memory,
            1 - (forward_times[i] / max_time),  # Invert for speed (higher is better)
            depths[i] / max_depth if isinstance(model_stats[i]['depth'], int) else 0
        ]
        values = np.concatenate((values, [values[0]]))
        
        ax4.plot(angles, values, 'o-', linewidth=2, label=names[i], color=colors[i])
        ax4.fill(angles, values, alpha=0.25, color=colors[i])
    
    ax4.set_xticks(angles[:-1])
    ax4.set_xticklabels(categories)
    ax4.set_ylim(0, 1)
    ax4.set_title('Architecture Complexity Profile', fontweight='bold', pad=20)
    ax4.legend(loc='upper right', bbox_to_anchor=(1.2, 1.0))

plt.suptitle('ResNet Architecture Analysis Dashboard', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.savefig(os.path.join(results_dir, 'resnet_analysis.png'), dpi=300, bbox_inches='tight')
plt.show()

print(f"\n💡 ResNet Architecture Insights:")
print(f"• Parameter growth: ResNet-50 has 25x more parameters than ResNet-18")
print(f"• Bottleneck design: More efficient than basic blocks for deep networks")
print(f"• Memory scaling: Roughly linear with parameter count")
print(f"• Speed trade-offs: Deeper models require more computation time")

print(f"\n✅ ResNet architectures successfully implemented and analyzed!")
```

---

## 4. DenseNet: Dense Connections and Feature Reuse

### 4.1 DenseNet Building Blocks

```python
print("=== 4.1 DenseNet Architecture Implementation ===\n")

class DenseLayer(nn.Module):
    """Dense layer with pre-activation design and growth rate."""
    
    def __init__(self, num_input_features: int, growth_rate: int, bn_size: int = 4, drop_rate: float = 0.0):
        super(DenseLayer, self).__init__()
        
        self.drop_rate = drop_rate
        
        # Bottleneck layers (1x1 → 3x3)
        self.norm1 = nn.BatchNorm2d(num_input_features)
        self.relu1 = nn.ReLU(inplace=True)
        self.conv1 = nn.Conv2d(num_input_features, bn_size * growth_rate, kernel_size=1, stride=1, bias=False)
        
        self.norm2 = nn.BatchNorm2d(bn_size * growth_rate)
        self.relu2 = nn.ReLU(inplace=True)
        self.conv2 = nn.Conv2d(bn_size * growth_rate, growth_rate, kernel_size=3, stride=1, padding=1, bias=False)
        
        if drop_rate > 0:
            self.dropout = nn.Dropout2d(drop_rate)
    
    def forward(self, x):
        # Handle concatenated features from previous layers
        if isinstance(x, torch.Tensor):
            prev_features = [x]
        else:
            prev_features = x
        
        concatenated = torch.cat(prev_features, 1)
        
        # Bottleneck (1x1 conv)
        bottleneck_output = self.conv1(self.relu1(self.norm1(concatenated)))
        
        # 3x3 conv
        new_features = self.conv2(self.relu2(self.norm2(bottleneck_output)))
        
        if self.drop_rate > 0:
            new_features = self.dropout(new_features)
        
        return new_features

class DenseBlock(nn.Module):
    """Dense block containing multiple dense layers with feature concatenation."""
    
    def __init__(self, num_layers: int, num_input_features: int, growth_rate: int, bn_size: int = 4, drop_rate: float = 0.0):
        super(DenseBlock, self).__init__()
        
        self.layers = nn.ModuleList()
        
        for i in range(num_layers):
            layer = DenseLayer(
                num_input_features + i * growth_rate,
                growth_rate=growth_rate,
                bn_size=bn_size,
                drop_rate=drop_rate
            )
            self.layers.append(layer)
    
    def forward(self, init_features):
        features = [init_features]
        
        for layer in self.layers:
            new_features = layer(features)
            features.append(new_features)
        
        return torch.cat(features, 1)

class Transition(nn.Module):
    """Transition layer between dense blocks for downsampling."""
    
    def __init__(self, num_input_features: int, num_output_features: int):
        super(Transition, self).__init__()
        
        self.norm = nn.BatchNorm2d(num_input_features)
        self.relu = nn.ReLU(inplace=True)
        self.conv = nn.Conv2d(num_input_features, num_output_features, kernel_size=1, stride=1, bias=False)
        self.pool = nn.AvgPool2d(kernel_size=2, stride=2)
    
    def forward(self, x):
        out = self.conv(self.relu(self.norm(x)))
        out = self.pool(out)
        return out

print("✅ DenseNet building blocks implemented!")
print("   • DenseLayer: Feature concatenation with growth rate")
print("   • DenseBlock: Multiple dense layers with dense connections")
print("   • Transition: Downsampling between blocks")
```

### 4.2 Complete DenseNet Architecture

```python
class DenseNet(nn.Module):
    """Complete DenseNet implementation with configurable architecture."""
    
    def __init__(self, growth_rate: int = 32, block_config: Tuple[int, int, int, int] = (6, 12, 24, 16),
                 num_init_features: int = 64, bn_size: int = 4, drop_rate: float = 0, num_classes: int = 1000):
        super(DenseNet, self).__init__()
        
        # First convolution (stem)
        self.features = nn.Sequential(OrderedDict([
            ('conv0', nn.Conv2d(3, num_init_features, kernel_size=7, stride=2, padding=3, bias=False)),
            ('norm0', nn.BatchNorm2d(num_init_features)),
            ('relu0', nn.ReLU(inplace=True)),
            ('pool0', nn.MaxPool2d(kernel_size=3, stride=2, padding=1)),
        ]))
        
        # Dense blocks and transitions
        num_features = num_init_features
        
        for i, num_layers in enumerate(block_config):
            # Add dense block
            block = DenseBlock(
                num_layers=num_layers,
                num_input_features=num_features,
                growth_rate=growth_rate,
                bn_size=bn_size,
                drop_rate=drop_rate
            )
            self.features.add_module(f'denseblock{i+1}', block)
            num_features = num_features + num_layers * growth_rate
            
            # Add transition layer (except after the last dense block)
            if i != len(block_config) - 1:
                trans = Transition(num_input_features=num_features, num_output_features=num_features // 2)
                self.features.add_module(f'transition{i+1}', trans)
                num_features = num_features // 2
        
        # Final batch norm
        self.features.add_module('norm5', nn.BatchNorm2d(num_features))
        
        # Linear classification layer
        self.classifier = nn.Linear(num_features, num_classes)
        
        # Initialize weights
        self._initialize_weights()
    
    def _initialize_weights(self):
        """Initialize network weights properly."""
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                nn.init.constant_(m.bias, 0)
    
    def forward(self, x):
        features = self.features(x)
        out = F.relu(features, inplace=True)
        out = F.adaptive_avg_pool2d(out, (1, 1))
        out = torch.flatten(out, 1)
        out = self.classifier(out)
        return out

# Factory functions for different DenseNet variants
def densenet121(num_classes: int = 1000, **kwargs):
    """DenseNet-121 model"""
    return DenseNet(growth_rate=32, block_config=(6, 12, 24, 16), num_init_features=64, num_classes=num_classes, **kwargs)

def densenet169(num_classes: int = 1000, **kwargs):
    """DenseNet-169 model"""
    return DenseNet(growth_rate=32, block_config=(6, 12, 32, 32), num_init_features=64, num_classes=num_classes, **kwargs)

def densenet201(num_classes: int = 1000, **kwargs):
    """DenseNet-201 model"""
    return DenseNet(growth_rate=32, block_config=(6, 12, 48, 32), num_init_features=64, num_classes=num_classes, **kwargs)

def densenet264(num_classes: int = 1000, **kwargs):
    """DenseNet-264 model"""
    return DenseNet(growth_rate=32, block_config=(6, 12, 64, 48), num_init_features=64, num_classes=num_classes, **kwargs)

print("✅ Complete DenseNet architecture implemented!")
print("   • Configurable growth rate and block structure")
print("   • Multiple variants: DenseNet-121/169/201/264")
print("   • Feature reuse through dense connections")
print("   • Memory-efficient implementation")
```

### 4.3 DenseNet Analysis and Comparison

```python
# Test DenseNet implementations
print("🌐 Testing DenseNet Implementations:")
print("=" * 50)

dense_models = {
    'DenseNet-121': densenet121(num_classes=10),
    'DenseNet-169': densenet169(num_classes=10),
    'DenseNet-201': densenet201(num_classes=10)
}

# Compare DenseNet variants
print(f"{'Model':<15} {'Parameters':<12} {'Growth Rate':<12} {'Memory Efficient':<15}")
print("-" * 55)

dense_stats = []
for name, model in dense_models.items():
    total_params = count_parameters(model)
    growth_rate = 32  # Default growth rate
    memory_efficient = "Yes" if total_params < 10_000_000 else "Moderate"
    
    print(f"{name:<15} {total_params:<12,} {growth_rate:<12} {memory_efficient:<15}")
    
    dense_stats.append({
        'name': name,
        'parameters': total_params,
        'growth_rate': growth_rate,
        'memory_efficient': memory_efficient
    })

# Visualize DenseNet characteristics
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))

# Plot 1: Parameter comparison with ResNet
resnet_params = [model_stats[i]['parameters'] for i in range(min(3, len(model_stats)))]
resnet_names = [model_stats[i]['name'] for i in range(min(3, len(model_stats)))]
dense_params = [s['parameters'] for s in dense_stats]
dense_names = [s['name'] for s in dense_stats]

x_pos_resnet = np.arange(len(resnet_names))
x_pos_dense = np.arange(len(dense_names)) + len(resnet_names) + 0.5

ax1.bar(x_pos_resnet, [p/1e6 for p in resnet_params], alpha=0.8, label='ResNet', color='blue')
ax1.bar(x_pos_dense, [p/1e6 for p in dense_params], alpha=0.8, label='DenseNet', color='green')

all_names = resnet_names + dense_names
all_positions = list(x_pos_resnet) + list(x_pos_dense)
ax1.set_xticks(all_positions)
ax1.set_xticklabels([name.replace('Net-', '-') for name in all_names], rotation=45)
ax1.set_ylabel('Parameters (Millions)')
ax1.set_title('Parameter Comparison: ResNet vs DenseNet', fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3, axis='y')

# Plot 2: Feature map growth visualization (theoretical)
layers = np.arange(1, 25)
growth_rate = 32
initial_features = 64

# DenseNet feature growth
dense_features = initial_features + growth_rate * layers
# ResNet feature growth (roughly constant within blocks)
resnet_features = np.piecewise(layers, 
                              [layers < 6, (layers >= 6) & (layers < 12), 
                               (layers >= 12) & (layers < 18), layers >= 18],
                              [64, 128, 256, 512])

ax2.plot(layers, dense_features, 'g-', linewidth=3, label='DenseNet (Linear Growth)', marker='o')
ax2.plot(layers, resnet_features, 'b--', linewidth=3, label='ResNet (Step Growth)', marker='s')
ax2.set_xlabel('Layer Depth')
ax2.set_ylabel('Feature Map Channels')
ax2.set_title('Feature Map Growth Patterns', fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

# Plot 3: Memory usage pattern
# Simulate memory usage during training (feature maps + gradients)
batch_size = 32
input_size = 224

dense_memory = []
resnet_memory = []

for layer in layers:
    # DenseNet: accumulates features
    dense_features_at_layer = initial_features + growth_rate * layer
    dense_mem = batch_size * dense_features_at_layer * (input_size // (2**min(layer//6, 3)))**2 * 4 / (1024**2)  # MB
    dense_memory.append(dense_mem)
    
    # ResNet: fixed features per stage
    resnet_features_at_layer = resnet_features[layer-1] if layer <= len(resnet_features) else 512
    resnet_mem = batch_size * resnet_features_at_layer * (input_size // (2**min(layer//6, 3)))**2 * 4 / (1024**2)  # MB
    resnet_memory.append(resnet_mem)

ax3.plot(layers, dense_memory, 'g-', linewidth=3, label='DenseNet Memory', marker='o')
ax3.plot(layers, resnet_memory, 'b--', linewidth=3, label='ResNet Memory', marker='s')
ax3.set_xlabel('Layer Depth')
ax3.set_ylabel('Memory Usage (MB)')
ax3.set_title('Memory Usage Patterns During Training', fontweight='bold')
ax3.legend()
ax3.grid(True, alpha=0.3)

# Plot 4: Feature reuse visualization (conceptual)
# Create a connectivity matrix showing feature reuse
num_layers = 12
connectivity = np.zeros((num_layers, num_layers))

# DenseNet: each layer connects to all previous layers
for i in range(num_layers):
    for j in range(i+1):
        connectivity[i, j] = 1

im = ax4.imshow(connectivity, cmap='Greens', alpha=0.8)
ax4.set_xlabel('Previous Layers')
ax4.set_ylabel('Current Layer')
ax4.set_title('DenseNet Feature Connectivity Pattern', fontweight='bold')
ax4.set_xticks(range(0, num_layers, 2))
ax4.set_yticks(range(0, num_layers, 2))

# Add colorbar
cbar = plt.colorbar(im, ax=ax4, shrink=0.8)
cbar.set_label('Connection Strength')

plt.suptitle('DenseNet Architecture Analysis', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.savefig(os.path.join(results_dir, 'densenet_analysis.png'), dpi=300, bbox_inches='tight')
plt.show()

print(f"\n💡 DenseNet Key Features:")
print(f"• Dense connections: Each layer connects to all subsequent layers")
print(f"• Feature reuse: Promotes feature reuse and gradient flow")
print(f"• Parameter efficiency: Fewer parameters than ResNet for similar performance")
print(f"• Growth rate: Controls the number of new features per layer")
print(f"• Memory considerations: Higher memory usage due to feature concatenation")

print(f"\n✅ DenseNet architectures successfully implemented and analyzed!")
```

---

## 5. Highway Networks and Advanced Gating Mechanisms

### 5.1 Highway Networks Implementation

```python
print("=== 5.1 Highway Networks Implementation ===\n")

class HighwayLayer(nn.Module):
    """Highway layer with gated skip connections for adaptive information flow."""
    
    def __init__(self, size: int, activation=F.relu, gate_bias: float = -2.0):
        super(HighwayLayer, self).__init__()
        
        self.activation = activation
        self.size = size
        
        # Transform gate - decides how much of the transformed input to let through
        self.transform_gate = nn.Linear(size, size)
        # Normal transformation layer
        self.normal_layer = nn.Linear(size, size)
        
        # Initialize carry gate with negative bias (prefer identity initially)
        self.transform_gate.bias.data.fill_(gate_bias)
    
    def forward(self, x):
        # Compute transform gate (sigmoid gives values between 0 and 1)
        transform_gate = torch.sigmoid(self.transform_gate(x))
        # Carry gate is complement of transform gate
        carry_gate = 1 - transform_gate
        
        # Normal layer output with activation
        normal_output = self.activation(self.normal_layer(x))
        
        # Highway combination: T * H(x) + (1-T) * x
        output = transform_gate * normal_output + carry_gate * x
        
        return output

class HighwayNetwork(nn.Module):
    """Deep Highway Network with multiple highway layers."""
    
    def __init__(self, input_size: int, num_layers: int = 10, activation=F.relu):
        super(HighwayNetwork, self).__init__()
        
        self.num_layers = num_layers
        self.input_size = input_size
        
        # Highway layers
        self.highway_layers = nn.ModuleList([
            HighwayLayer(input_size, activation) for _ in range(num_layers)
        ])
        
        # Optional: Add a final classification layer
        self.classifier = nn.Linear(input_size, 10)  # For demonstration
    
    def forward(self, x):
        # Pass through all highway layers
        for highway_layer in self.highway_layers:
            x = highway_layer(x)
        return x
    
    def forward_with_gates(self, x):
        """Forward pass that also returns gate activations for analysis."""
        gate_activations = []
        
        for highway_layer in self.highway_layers:
            # Get gate activation
            transform_gate = torch.sigmoid(highway_layer.transform_gate(x))
            gate_activations.append(transform_gate.mean().item())  # Average gate activation
            
            # Apply highway layer
            x = highway_layer(x)
        
        return x, gate_activations

print("✅ Highway Networks implemented!")
print("   • Learnable gating mechanism")
print("   • Adaptive information flow")
print("   • Enables training of very deep networks")
```

### 5.2 Attention Mechanisms

```python
print("=== 5.2 Attention Mechanisms Implementation ===\n")

class ChannelAttention(nn.Module):
    """Channel attention mechanism (Squeeze-and-Excitation)."""
    
    def __init__(self, channels: int, reduction: int = 16):
        super(ChannelAttention, self).__init__()
        
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.max_pool = nn.AdaptiveMaxPool2d(1)
        
        # Shared MLP for both average and max pooled features
        self.mlp = nn.Sequential(
            nn.Linear(channels, channels // reduction, bias=False),
            nn.ReLU(inplace=True),
            nn.Linear(channels // reduction, channels, bias=False)
        )
        
        self.sigmoid = nn.Sigmoid()
    
    def forward(self, x):
        batch_size, channels, height, width = x.size()
        
        # Average pooling branch
        avg_pool = self.avg_pool(x).view(batch_size, channels)
        avg_pool = self.mlp(avg_pool)
        
        # Max pooling branch
        max_pool = self.max_pool(x).view(batch_size, channels)
        max_pool = self.mlp(max_pool)
        
        # Combine and apply sigmoid activation
        channel_att = self.sigmoid(avg_pool + max_pool).view(batch_size, channels, 1, 1)
        
        return x * channel_att

class SpatialAttention(nn.Module):
    """Spatial attention mechanism focusing on 'where' is important."""
    
    def __init__(self, kernel_size: int = 7):
        super(SpatialAttention, self).__init__()
        
        self.conv = nn.Conv2d(2, 1, kernel_size=kernel_size, padding=kernel_size//2, bias=False)
        self.sigmoid = nn.Sigmoid()
    
    def forward(self, x):
        # Channel-wise statistics
        avg_out = torch.mean(x, dim=1, keepdim=True)
        max_out, _ = torch.max(x, dim=1, keepdim=True)
        
        # Concatenate along channel dimension
        concat = torch.cat([avg_out, max_out], dim=1)
        
        # Apply convolution and sigmoid
        spatial_att = self.sigmoid(self.conv(concat))
        
        return x * spatial_att

class CBAM(nn.Module):
    """Convolutional Block Attention Module combining channel and spatial attention."""
    
    def __init__(self, channels: int, reduction: int = 16, kernel_size: int = 7):
        super(CBAM, self).__init__()
        
        self.channel_att = ChannelAttention(channels, reduction)
        self.spatial_att = SpatialAttention(kernel_size)
    
    def forward(self, x):
        # Apply channel attention first, then spatial attention
        x = self.channel_att(x)
        x = self.spatial_att(x)
        return x

print("✅ Attention mechanisms implemented!")
print("   • Channel Attention: Focus on 'what' is important")
print("   • Spatial Attention: Focus on 'where' is important")
print("   • CBAM: Combined channel and spatial attention")
```

### 5.3 Highway Networks and Attention Testing

```python
# Test Highway Networks and Attention mechanisms
print("🛣️ Testing Highway Networks and Attention Mechanisms:")
print("=" * 60)

# Test Highway Network
highway_net = HighwayNetwork(input_size=256, num_layers=20)
test_input_1d = torch.randn(32, 256)

print("Highway Network Analysis:")
with torch.no_grad():
    highway_output, gate_activations = highway_net.forward_with_gates(test_input_1d)

print(f"  Input shape: {test_input_1d.shape}")
print(f"  Output shape: {highway_output.shape}")
print(f"  Network depth: {highway_net.num_layers} layers")
print(f"  Average gate activation per layer:")

for i, gate_act in enumerate(gate_activations):
    print(f"    Layer {i+1:2d}: {gate_act:.3f} ({'High transform' if gate_act > 0.5 else 'High carry'})")

# Test Attention Mechanisms
test_input_2d = torch.randn(8, 64, 32, 32)

print(f"\nAttention Mechanisms Analysis:")
print(f"  Input shape: {test_input_2d.shape}")

# Channel attention
channel_att = ChannelAttention(64)
channel_output = channel_att(test_input_2d)
print(f"  Channel attention output: {channel_output.shape}")

# Spatial attention
spatial_att = SpatialAttention()
spatial_output = spatial_att(test_input_2d)
print(f"  Spatial attention output: {spatial_output.shape}")

# CBAM
cbam = CBAM(64)
cbam_output = cbam(test_input_2d)
print(f"  CBAM output: {cbam_output.shape}")

# Visualize gate activations and attention effects
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))

# Plot 1: Highway gate activations
layers = range(1, len(gate_activations) + 1)
ax1.plot(layers, gate_activations, 'bo-', linewidth=2, markersize=6)
ax1.axhline(y=0.5, color='red', linestyle='--', alpha=0.7, label='Transform/Carry Balance')
ax1.fill_between(layers, gate_activations, 0.5, 
                 where=[g > 0.5 for g in gate_activations], 
                 alpha=0.3, color='blue', label='Transform Dominant')
ax1.fill_between(layers, gate_activations, 0.5, 
                 where=[g <= 0.5 for g in gate_activations], 
                 alpha=0.3, color='red', label='Carry Dominant')
ax1.set_xlabel('Highway Layer')
ax1.set_ylabel('Transform Gate Activation')
ax1.set_title('Highway Network Gate Analysis', fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Plot 2: Attention mechanism comparison (simulate attention weights)
mechanisms = ['Channel\nAttention', 'Spatial\nAttention', 'CBAM']
# Simulate effectiveness scores
effectiveness = [0.85, 0.78, 0.92]
colors = ['blue', 'green', 'red']

bars = ax2.bar(mechanisms, effectiveness, color=colors, alpha=0.7)
ax2.set_ylabel('Effectiveness Score')
ax2.set_title('Attention Mechanism Comparison', fontweight='bold')
ax2.set_ylim(0, 1)
ax2.grid(True, alpha=0.3, axis='y')

# Add value labels on bars
for bar, eff in zip(bars, effectiveness):
    height = bar.get_height()
    ax2.text(bar.get_x() + bar.get_width()/2., height + 0.02,
             f'{eff:.2f}', ha='center', va='bottom', fontweight='bold')

# Plot 3: Feature map activation visualization (simulated)
# Create a simulated feature map and its attention-weighted version
np.random.seed(42)
feature_map = np.random.rand(16, 16)
attention_map = np.exp(-((np.arange(16)[:, None] - 8)**2 + (np.arange(16) - 8)**2) / 20)
attended_features = feature_map * attention_map

im1 = ax3.imshow(feature_map, cmap='viridis', alpha=0.8)
ax3.set_title('Original Feature Map', fontweight='bold')
ax3.axis('off')
plt.colorbar(im1, ax=ax3, shrink=0.8)

im2 = ax4.imshow(attended_features, cmap='viridis', alpha=0.8)
ax4.set_title('Attention-Weighted Features', fontweight='bold')
ax4.axis('off')
plt.colorbar(im2, ax=ax4, shrink=0.8)

plt.suptitle('Advanced Architecture Components Analysis', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.savefig(os.path.join(results_dir, 'highway_attention_analysis.png'), dpi=300, bbox_inches='tight')
plt.show()

print(f"\n💡 Advanced Architecture Features:")
print(f"• Highway Networks: Learnable gating for information flow")
print(f"• Channel Attention: Focus on important feature channels")  
print(f"• Spatial Attention: Focus on important spatial locations")
print(f"• CBAM: Combined channel and spatial attention for maximum effectiveness")

print(f"\n✅ Highway Networks and Attention mechanisms successfully implemented!")
```

---

## 6. Custom Architecture Design Framework

### 6.1 Modular Building Blocks

```python
print("=== 6.1 Custom Architecture Design Framework ===\n")

class ModularBlock(nn.Module):
    """Highly configurable modular block for custom architecture design."""
    
    def __init__(self, in_channels: int, out_channels: int, 
                 block_type: str = 'basic', activation: str = 'relu',
                 normalization: str = 'batch', attention: str = 'none',
                 dropout_rate: float = 0.0):
        super().__init__()
        
        self.block_type = block_type
        self.in_channels = in_channels
        self.out_channels = out_channels
        
        # Main convolution layers based on block type
        if block_type == 'basic':
            self.conv = nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1, bias=False)
        elif block_type == 'bottleneck':
            self.conv1 = nn.Conv2d(in_channels, out_channels // 4, kernel_size=1, bias=False)
            self.conv2 = nn.Conv2d(out_channels // 4, out_channels // 4, kernel_size=3, padding=1, bias=False)
            self.conv3 = nn.Conv2d(out_channels // 4, out_channels, kernel_size=1, bias=False)
        elif block_type == 'depthwise':
            self.depthwise = nn.Conv2d(in_channels, in_channels, kernel_size=3, padding=1, groups=in_channels, bias=False)
            self.pointwise = nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=False)
        elif block_type == 'inverted':
            expand_ratio = 6
            expanded_channels = in_channels * expand_ratio
            self.conv1 = nn.Conv2d(in_channels, expanded_channels, kernel_size=1, bias=False)
            self.conv2 = nn.Conv2d(expanded_channels, expanded_channels, kernel_size=3, padding=1, groups=expanded_channels, bias=False)
            self.conv3 = nn.Conv2d(expanded_channels, out_channels, kernel_size=1, bias=False)
        
        # Normalization layers
        self._setup_normalization(normalization, out_channels)
        
        # Activation function
        self.activation = self._get_activation(activation)
        
        # Attention mechanism
        self.attention = self._get_attention(attention, out_channels)
        
        # Dropout
        self.dropout = nn.Dropout2d(dropout_rate) if dropout_rate > 0 else nn.Identity()
        
        # Skip connection setup
        self.use_skip = (in_channels == out_channels)
        if not self.use_skip and in_channels != out_channels:
            self.skip_conv = nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=False)
            self.skip_norm = self._get_normalization_layer(normalization, out_channels)
    
    def _setup_normalization(self, normalization: str, channels: int):
        """Setup normalization layers based on block type."""
        if self.block_type == 'bottleneck':
            if normalization == 'batch':
                self.norm1 = nn.BatchNorm2d(channels // 4)
                self.norm2 = nn.BatchNorm2d(channels // 4)
                self.norm3 = nn.BatchNorm2d(channels)
            elif normalization == 'group':
                self.norm1 = nn.GroupNorm(32, channels // 4)
                self.norm2 = nn.GroupNorm(32, channels // 4)
                self.norm3 = nn.GroupNorm(32, channels)
        elif self.block_type == 'inverted':
            if normalization == 'batch':
                self.norm1 = nn.BatchNorm2d(self.in_channels * 6)  # expansion
                self.norm2 = nn.BatchNorm2d(self.in_channels * 6)
                self.norm3 = nn.BatchNorm2d(channels)
            # Add other normalizations as needed
        else:
            self.norm = self._get_normalization_layer(normalization, channels)
    
    def _get_normalization_layer(self, normalization: str, channels: int):
        """Get normalization layer based on type."""
        if normalization == 'batch':
            return nn.BatchNorm2d(channels)
        elif normalization == 'layer':
            return nn.GroupNorm(1, channels)
        elif normalization == 'group':
            return nn.GroupNorm(min(32, channels), channels)
        elif normalization == 'instance':
            return nn.InstanceNorm2d(channels)
        else:
            return nn.Identity()
    
    def _get_activation(self, activation: str):
        """Get activation function based on type."""
        if activation == 'relu':
            return nn.ReLU(inplace=True)
        elif activation == 'gelu':
            return nn.GELU()
        elif activation == 'swish' or activation == 'silu':
            return nn.SiLU()
        elif activation == 'leaky_relu':
            return nn.LeakyReLU(0.1, inplace=True)
        elif activation == 'elu':
            return nn.ELU(inplace=True)
        else:
            return nn.ReLU(inplace=True)
    
    def _get_attention(self, attention: str, channels: int):
        """Get attention mechanism based on type."""
        if attention == 'se' or attention == 'channel':
            return ChannelAttention(channels)
        elif attention == 'spatial':
            return SpatialAttention()
        elif attention == 'cbam':
            return CBAM(channels)
        else:
            return nn.Identity()
    
    def forward(self, x):
        identity = x
        
        if self.block_type == 'basic':
            out = self.conv(x)
            out = self.norm(out)
            out = self.activation(out)
        
        elif self.block_type == 'bottleneck':
            out = self.activation(self.norm1(self.conv1(x)))
            out = self.activation(self.norm2(self.conv2(out)))
            out = self.norm3(self.conv3(out))
        
        elif self.block_type == 'depthwise':
            out = self.depthwise(x)
            out = self.pointwise(out)
            out = self.norm(out)
            out = self.activation(out)
        
        elif self.block_type == 'inverted':
            out = self.activation(self.norm1(self.conv1(x)))  # Expand
            out = self.activation(self.norm2(self.conv2(out)))  # Depthwise
            out = self.norm3(self.conv3(out))  # Project (no activation)
        
        # Apply attention
        out = self.attention(out)
        
        # Apply dropout
        out = self.dropout(out)
        
        # Skip connection
        if self.use_skip:
            out += identity
        elif hasattr(self, 'skip_conv'):
            skip_out = self.skip_conv(identity)
            skip_out = self.skip_norm(skip_out)
            out += skip_out
        
        return out

print("✅ Modular building blocks implemented!")
print("   • Block types: basic, bottleneck, depthwise, inverted bottleneck")
print("   • Normalizations: batch, layer, group, instance")
print("   • Activations: ReLU, GELU, SiLU, LeakyReLU, ELU")
print("   • Attention: SE, spatial, CBAM")
print("   • Flexible skip connections and dropout")
```

### 6.2 Custom Architecture Builder

```python
class CustomArchitecture(nn.Module):
    """Flexible custom architecture builder with comprehensive configuration options."""
    
    def __init__(self, architecture_config: Dict, num_classes: int = 1000):
        super().__init__()
        
        self.config = architecture_config
        self.num_classes = num_classes
        
        # Build components
        self.stem = self._build_stem(architecture_config.get('stem', {}))
        self.blocks = self._build_blocks(architecture_config.get('blocks', []))
        self.head = self._build_head(architecture_config.get('head', {}))
        
        # Initialize weights
        self._initialize_weights()
    
    def _build_stem(self, stem_config: Dict):
        """Build the stem (initial feature extraction layers)."""
        layers = []
        in_channels = stem_config.get('in_channels', 3)
        out_channels = stem_config.get('out_channels', 64)
        kernel_size = stem_config.get('kernel_size', 7)
        stride = stem_config.get('stride', 2)
        padding = stem_config.get('padding', kernel_size//2)
        
        # Initial convolution
        layers.append(nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding, bias=False))
        layers.append(nn.BatchNorm2d(out_channels))
        layers.append(nn.ReLU(inplace=True))
        
        # Optional max pooling
        if stem_config.get('max_pool', True):
            pool_size = stem_config.get('pool_size', 3)
            pool_stride = stem_config.get('pool_stride', 2)
            pool_padding = stem_config.get('pool_padding', 1)
            layers.append(nn.MaxPool2d(pool_size, stride=pool_stride, padding=pool_padding))
        
        return nn.Sequential(*layers)
    
    def _build_blocks(self, blocks_config: List[Dict]):
        """Build the main processing blocks."""
        blocks = nn.ModuleList()
        current_channels = self.config['stem'].get('out_channels', 64)
        
        for stage_idx, block_config in enumerate(blocks_config):
            num_blocks = block_config.get('num_blocks', 2)
            out_channels = block_config.get('out_channels', current_channels)
            block_type = block_config.get('block_type', 'basic')
            activation = block_config.get('activation', 'relu')
            normalization = block_config.get('normalization', 'batch')
            attention = block_config.get('attention', 'none')
            dropout_rate = block_config.get('dropout_rate', 0.0)
            stride = block_config.get('stride', 1)
            
            # Create stage blocks
            for block_idx in range(num_blocks):
                # First block in stage may have stride for downsampling
                current_stride = stride if block_idx == 0 else 1
                
                block = ModularBlock(
                    current_channels, out_channels, block_type,
                    activation, normalization, attention, dropout_rate
                )
                
                # Handle stride for first block (add pooling if needed)
                if current_stride > 1 and block_idx == 0:
                    # Wrap block with downsampling
                    block = nn.Sequential(
                        block,
                        nn.AvgPool2d(kernel_size=current_stride, stride=current_stride)
                    )
                
                blocks.append(block)
                current_channels = out_channels
        
        return blocks
    
    def _build_head(self, head_config: Dict):
        """Build the classification/output head."""
        layers = []
        
        # Global pooling
        pool_type = head_config.get('pool_type', 'avg')
        if pool_type == 'avg':
            layers.append(nn.AdaptiveAvgPool2d(1))
        elif pool_type == 'max':
            layers.append(nn.AdaptiveMaxPool2d(1))
        elif pool_type == 'gem':
            # Generalized Mean Pooling (simplified)
            layers.append(nn.AdaptiveAvgPool2d(1))  # Fallback to avg
        
        layers.append(nn.Flatten())
        
        # Optional intermediate layers
        intermediate_dims = head_config.get('intermediate_dims', [])
        current_dim = head_config.get('in_features', 512)
        
        for dim in intermediate_dims:
            layers.append(nn.Linear(current_dim, dim))
            layers.append(nn.ReLU(inplace=True))
            
            # Dropout
            dropout = head_config.get('dropout', 0.0)
            if dropout > 0:
                layers.append(nn.Dropout(dropout))
            
            current_dim = dim
        
        # Final classifier
        layers.append(nn.Linear(current_dim, self.num_classes))
        
        return nn.Sequential(*layers)
    
    def _initialize_weights(self):
        """Initialize network weights using best practices."""
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
            elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.constant_(m.bias, 0)
    
    def forward(self, x):
        x = self.stem(x)
        
        for block in self.blocks:
            x = block(x)
        
        x = self.head(x)
        return x
    
    def get_architecture_summary(self):
        """Get a summary of the architecture configuration."""
        total_params = count_parameters(self)
        
        summary = {
            'total_parameters': total_params,
            'stem_channels': self.config['stem'].get('out_channels', 64),
            'num_stages': len(self.config.get('blocks', [])),
            'num_blocks': sum(block.get('num_blocks', 2) for block in self.config.get('blocks', [])),
            'num_classes': self.num_classes,
            'block_types': list(set(block.get('block_type', 'basic') for block in self.config.get('blocks', []))),
            'attention_types': list(set(block.get('attention', 'none') for block in self.config.get('blocks', []) if block.get('attention', 'none') != 'none'))
        }
        
        return summary

print("✅ Custom Architecture Builder implemented!")
print("   • Flexible stem, blocks, and head configuration")
print("   • Support for multiple architectural patterns")
print("   • Automatic weight initialization")
print("   • Architecture summary generation")
```

### 6.3 Pre-defined Architecture Configurations

```python
def create_efficient_net_config():
    """Create MobileNet/EfficientNet-style configuration."""
    return {
        'stem': {
            'in_channels': 3,
            'out_channels': 32,
            'kernel_size': 3,
            'stride': 2,
            'max_pool': False
        },
        'blocks': [
            {'num_blocks': 1, 'out_channels': 16, 'block_type': 'inverted', 'stride': 1, 'attention': 'se'},
            {'num_blocks': 2, 'out_channels': 24, 'block_type': 'inverted', 'stride': 2, 'attention': 'se'},
            {'num_blocks': 2, 'out_channels': 40, 'block_type': 'inverted', 'stride': 2, 'attention': 'se'},
            {'num_blocks': 3, 'out_channels': 80, 'block_type': 'inverted', 'stride': 2, 'attention': 'se'},
            {'num_blocks': 3, 'out_channels': 112, 'block_type': 'inverted', 'stride': 1, 'attention': 'se'},
            {'num_blocks': 4, 'out_channels': 192, 'block_type': 'inverted', 'stride': 2, 'attention': 'se'},
            {'num_blocks': 1, 'out_channels': 320, 'block_type': 'inverted', 'stride': 1, 'attention': 'se'},
        ],
        'head': {
            'in_features': 320,
            'pool_type': 'avg',
            'dropout': 0.2
        }
    }

def create_resnet_style_config():
    """Create ResNet-style configuration."""
    return {
        'stem': {
            'in_channels': 3,
            'out_channels': 64,
            'kernel_size': 7,
            'stride': 2,
            'max_pool': True
        },
        'blocks': [
            {'num_blocks': 2, 'out_channels': 64, 'block_type': 'basic', 'stride': 1},
            {'num_blocks': 2, 'out_channels': 128, 'block_type': 'basic', 'stride': 2},
            {'num_blocks': 2, 'out_channels': 256, 'block_type': 'basic', 'stride': 2},
            {'num_blocks': 2, 'out_channels': 512, 'block_type': 'basic', 'stride': 2},
        ],
        'head': {
            'in_features': 512,
            'pool_type': 'avg',
            'dropout': 0.1
        }
    }

def create_hybrid_config():
    """Create a hybrid architecture combining multiple techniques."""
    return {
        'stem': {
            'in_channels': 3,
            'out_channels': 64,
            'kernel_size': 3,
            'stride': 2,
            'max_pool': True
        },
        'blocks': [
            {'num_blocks': 2, 'out_channels': 64, 'block_type': 'basic', 'attention': 'none', 'activation': 'relu'},
            {'num_blocks': 2, 'out_channels': 128, 'block_type': 'bottleneck', 'attention': 'se', 'stride': 2, 'activation': 'swish'},
            {'num_blocks': 3, 'out_channels': 256, 'block_type': 'inverted', 'attention': 'cbam', 'stride': 2, 'activation': 'gelu'},
            {'num_blocks': 2, 'out_channels': 512, 'block_type': 'depthwise', 'attention': 'spatial', 'stride': 2, 'activation': 'swish'},
        ],
        'head': {
            'in_features': 512,
            'pool_type': 'avg',
            'intermediate_dims': [256],
            'dropout': 0.15
        }
    }

def create_vision_transformer_cnn_hybrid():
    """Create a hybrid CNN architecture inspired by Vision Transformers."""
    return {
        'stem': {
            'in_channels': 3,
            'out_channels': 96,
            'kernel_size': 4,
            'stride': 4,  # Patch-like stem
            'max_pool': False
        },
        'blocks': [
            {'num_blocks': 2, 'out_channels': 96, 'block_type': 'bottleneck', 'attention': 'cbam', 'normalization': 'layer'},
            {'num_blocks': 2, 'out_channels': 192, 'block_type': 'bottleneck', 'attention': 'cbam', 'stride': 2, 'normalization': 'layer'},
            {'num_blocks': 6, 'out_channels': 384, 'block_type': 'bottleneck', 'attention': 'cbam', 'stride': 2, 'normalization': 'layer'},
            {'num_blocks': 2, 'out_channels': 768, 'block_type': 'bottleneck', 'attention': 'cbam', 'stride': 2, 'normalization': 'layer'},
        ],
        'head': {
            'in_features': 768,
            'pool_type': 'avg',
            'intermediate_dims': [512, 256],
            'dropout': 0.3
        }
    }

print("✅ Pre-defined architecture configurations created!")
print("   • EfficientNet-style: Mobile-optimized with inverted bottlenecks")
print("   • ResNet-style: Classic residual architecture")
print("   • Hybrid: Combines multiple block types and attention mechanisms")
print("   • ViT-CNN Hybrid: CNN architecture inspired by Vision Transformers")
```

### 6.4 Custom Architecture Testing and Analysis

```python
# Test custom architectures
print("🎨 Testing Custom Architecture Designs:")
print("=" * 60)

configs = {
    'EfficientNet-Style': create_efficient_net_config(),
    'ResNet-Style': create_resnet_style_config(),
    'Hybrid-Architecture': create_hybrid_config(),
    'ViT-CNN-Hybrid': create_vision_transformer_cnn_hybrid()
}

custom_models = {}
architecture_summaries = {}

for name, config in configs.items():
    print(f"\n🏗️ Building {name}...")
    try:
        model = CustomArchitecture(config, num_classes=10)
        custom_models[name] = model
        architecture_summaries[name] = model.get_architecture_summary()
        
        # Test forward pass
        test_input = torch.randn(1, 3, 224, 224)
        model.eval()
        with torch.no_grad():
            output = model(test_input)
        
        print(f"   ✅ Success! Output shape: {output.shape}")
        
    except Exception as e:
        print(f"   ❌ Error: {e}")

# Comprehensive analysis table
print(f"\n📊 Custom Architecture Analysis:")
print("=" * 100)
print(f"{'Architecture':<20} {'Parameters':<12} {'Blocks':<8} {'Block Types':<25} {'Attention':<20}")
print("-" * 100)

for name, model in custom_models.items():
    summary = architecture_summaries[name]
    
    # Format block types
    block_types_str = ', '.join(summary['block_types'])[:24]
    attention_str = ', '.join(summary['attention_types'])[:19] if summary['attention_types'] else 'None'
    
    print(f"{name:<20} {summary['total_parameters']:<12,} {summary['num_blocks']:<8} {block_types_str:<25} {attention_str:<20}")

# Create comprehensive visualization
fig = plt.figure(figsize=(20, 15))

# Create a 3x3 grid for various analyses
gs = fig.add_gridspec(3, 3, hspace=0.3, wspace=0.3)

# Plot 1: Parameter comparison
ax1 = fig.add_subplot(gs[0, 0])
names = list(custom_models.keys())
params = [architecture_summaries[name]['total_parameters'] / 1e6 for name in names]
colors = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4']

bars = ax1.bar(range(len(names)), params, color=colors, alpha=0.8)
ax1.set_ylabel('Parameters (Millions)')
ax1.set_title('Model Size Comparison', fontweight='bold')
ax1.set_xticks(range(len(names)))
ax1.set_xticklabels([name.replace('-', '\n') for name in names], fontsize=10)

# Add value labels
for bar, param in zip(bars, params):
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height + 0.1,
             f'{param:.1f}M', ha='center', va='bottom', fontweight='bold', fontsize=9)

# Plot 2: Architecture complexity radar
ax2 = fig.add_subplot(gs[0, 1], projection='polar')
categories = ['Parameters', 'Depth', 'Block Diversity', 'Attention']
angles = np.linspace(0, 2 * np.pi, len(categories), endpoint=False)
angles = np.concatenate((angles, [angles[0]]))

# Normalize metrics
max_params = max(params)
max_blocks = max(architecture_summaries[name]['num_blocks'] for name in names)
max_block_types = max(len(architecture_summaries[name]['block_types']) for name in names)
max_attention = max(len(architecture_summaries[name]['attention_types']) for name in names) or 1

for i, name in enumerate(names[:3]):  # Limit to first 3 for clarity
    summary = architecture_summaries[name]
    values = [
        (summary['total_parameters'] / 1e6) / max_params,
        summary['num_blocks'] / max_blocks,
        len(summary['block_types']) / max_block_types,
        len(summary['attention_types']) / max_attention if summary['attention_types'] else 0
    ]
    values = np.concatenate((values, [values[0]]))
    
    ax2.plot(angles, values, 'o-', linewidth=2, label=name.split('-')[0], color=colors[i])
    ax2.fill(angles, values, alpha=0.25, color=colors[i])

ax2.set_xticks(angles[:-1])
ax2.set_xticklabels(categories)
ax2.set_ylim(0, 1)
ax2.set_title('Architecture Complexity Profile', fontweight='bold', pad=20)
ax2.legend(loc='upper right', bbox_to_anchor=(1.2, 1.0))

# Plot 3: Block type distribution
ax3 = fig.add_subplot(gs[0, 2])
all_block_types = set()
for summary in architecture_summaries.values():
    all_block_types.update(summary['block_types'])

block_type_counts = {bt: 0 for bt in all_block_types}
for summary in architecture_summaries.values():
    for bt in summary['block_types']:
        block_type_counts[bt] += 1

block_types = list(block_type_counts.keys())
counts = list(block_type_counts.values())

ax3.pie(counts, labels=block_types, autopct='%1.1f%%', startangle=90, colors=colors[:len(block_types)])
ax3.set_title('Block Type Distribution', fontweight='bold')

# Plot 4: Attention mechanism usage
ax4 = fig.add_subplot(gs[1, 0])
attention_usage = {}
for name, summary in architecture_summaries.items():
    for att_type in summary['attention_types']:
        if att_type not in attention_usage:
            attention_usage[att_type] = []
        attention_usage[att_type].append(name.split('-')[0])

if attention_usage:
    att_types = list(attention_usage.keys())
    att_counts = [len(attention_usage[att]) for att in att_types]
    
    bars = ax4.bar(att_types, att_counts, color=colors[:len(att_types)], alpha=0.8)
    ax4.set_ylabel('Number of Architectures')
    ax4.set_title('Attention Mechanism Usage', fontweight='bold')
    
    # Add architecture names as labels
    for i, (bar, att_type) in enumerate(zip(bars, att_types)):
        height = bar.get_height()
        models_using = ', '.join(attention_usage[att_type])
        ax4.text(bar.get_x() + bar.get_width()/2., height + 0.05,
                 models_using, ha='center', va='bottom', fontsize=8, rotation=0)
else:
    ax4.text(0.5, 0.5, 'No Attention\nMechanisms Used', ha='center', va='center', 
             transform=ax4.transAxes, fontsize=12, fontweight='bold')
    ax4.set_title('Attention Mechanism Usage', fontweight='bold')

# Plot 5: Memory efficiency vs Performance (simulated)
ax5 = fig.add_subplot(gs[1, 1])
# Simulate performance scores based on architecture characteristics
performance_scores = []
memory_efficiency = []

for name in names:
    summary = architecture_summaries[name]
    
    # Simulate performance (higher for more sophisticated architectures)
    perf_score = 70 + len(summary['attention_types']) * 5 + len(summary['block_types']) * 3
    perf_score += np.random.normal(0, 2)  # Add some noise
    performance_scores.append(min(95, max(70, perf_score)))  # Clamp between 70-95
    
    # Memory efficiency (inversely related to parameters)
    mem_eff = 100 - (summary['total_parameters'] / 1e6) * 5
    memory_efficiency.append(max(20, min(90, mem_eff)))  # Clamp between 20-90

scatter = ax5.scatter(memory_efficiency, performance_scores, s=200, c=colors[:len(names)], alpha=0.7)

for i, name in enumerate(names):
    ax5.annotate(name.split('-')[0], (memory_efficiency[i], performance_scores[i]),
                 xytext=(5, 5), textcoords='offset points', fontsize=9, fontweight='bold')

ax5.set_xlabel('Memory Efficiency Score')
ax5.set_ylabel('Performance Score')
ax5.set_title('Efficiency vs Performance Trade-off', fontweight='bold')
ax5.grid(True, alpha=0.3)

# Plot 6: Architecture design space
ax6 = fig.add_subplot(gs[1, 2])
design_categories = ['Efficiency', 'Flexibility', 'Innovation', 'Robustness']

# Simulate scores for each architecture
design_scores = {
    'EfficientNet-Style': [0.9, 0.6, 0.7, 0.8],
    'ResNet-Style': [0.7, 0.8, 0.5, 0.9],
    'Hybrid-Architecture': [0.6, 0.9, 0.9, 0.7],
    'ViT-CNN-Hybrid': [0.5, 0.7, 1.0, 0.6]
}

x = np.arange(len(design_categories))
bar_width = 0.2

for i, (name, scores) in enumerate(design_scores.items()):
    if name in custom_models:  # Only plot if model was successfully created
        ax6.bar(x + i * bar_width, scores, bar_width, 
                label=name.split('-')[0], color=colors[i], alpha=0.8)

ax6.set_xlabel('Design Aspects')
ax6.set_ylabel('Score')
ax6.set_title('Architecture Design Profile', fontweight='bold')
ax6.set_xticks(x + bar_width * 1.5)
ax6.set_xticklabels(design_categories)
ax6.legend()
ax6.grid(True, alpha=0.3, axis='y')

# Plot 7: Feature extraction stages
ax7 = fig.add_subplot(gs[2, 0])
stage_info = {}
for name, config in configs.items():
    if name in custom_models:
        stages = []
        for i, block in enumerate(config['blocks']):
            stages.append(f"Stage{i+1}\n({block['out_channels']}ch)")
        stage_info[name] = stages

if stage_info:
    # Show architecture flow for one example
    example_name = list(stage_info.keys())[0]
    stages = stage_info[example_name]
    
    # Create flow diagram
    y_pos = 0.5
    stage_width = 0.8 / len(stages)
    
    for i, stage in enumerate(stages):
        x_pos = 0.1 + i * stage_width
        
        # Draw box
        box = plt.Rectangle((x_pos, y_pos - 0.1), stage_width * 0.8, 0.2,
                           facecolor=colors[0], alpha=0.7, edgecolor='black')
        ax7.add_patch(box)
        
        # Add text
        ax7.text(x_pos + stage_width * 0.4, y_pos, stage,
                ha='center', va='center', fontsize=8, fontweight='bold')
        
        # Add arrow (except for last stage)
        if i < len(stages) - 1:
            ax7.arrow(x_pos + stage_width * 0.8, y_pos, stage_width * 0.15, 0,
                     head_width=0.02, head_length=0.02, fc='black', ec='black')

ax7.set_xlim(0, 1)
ax7.set_ylim(0, 1)
ax7.set_title(f'Feature Extraction Flow\n({example_name})', fontweight='bold')
ax7.axis('off')

# Plot 8: Training considerations
ax8 = fig.add_subplot(gs[2, 1])
considerations = ['Memory\nUsage', 'Training\nTime', 'Convergence\nSpeed', 'Generalization']

# Simulated training characteristics
training_data = {
    'EfficientNet-Style': [0.8, 0.7, 0.8, 0.9],  # Efficient, fast
    'ResNet-Style': [0.6, 0.8, 0.9, 0.8],       # Reliable baseline
    'Hybrid-Architecture': [0.4, 0.5, 0.6, 0.9], # Complex but powerful
    'ViT-CNN-Hybrid': [0.3, 0.4, 0.5, 0.8]      # Most complex
}

x = np.arange(len(considerations))
bar_width = 0.15

for i, (name, scores) in enumerate(training_data.items()):
    if name in custom_models:
        ax8.bar(x + i * bar_width, scores, bar_width,
                label=name.split('-')[0], color=colors[i], alpha=0.8)

ax8.set_xlabel('Training Aspects')
ax8.set_ylabel('Score (Higher is Better)')
ax8.set_title('Training Characteristics', fontweight='bold')
ax8.set_xticks(x + bar_width * 1.5)
ax8.set_xticklabels(considerations, fontsize=9)
ax8.legend()
ax8.grid(True, alpha=0.3, axis='y')

# Plot 9: Use case recommendations
ax9 = fig.add_subplot(gs[2, 2])
use_cases = ['Mobile\nDeployment', 'Research\nExperiments', 'Production\nSystems', 'Transfer\nLearning']

# Suitability scores for each use case
suitability_scores = {
    'EfficientNet-Style': [0.95, 0.6, 0.8, 0.7],
    'ResNet-Style': [0.6, 0.7, 0.95, 0.9],
    'Hybrid-Architecture': [0.3, 0.95, 0.6, 0.8],
    'ViT-CNN-Hybrid': [0.2, 0.9, 0.5, 0.6]
}

# Create heatmap
suitability_matrix = []
arch_labels = []
for name, scores in suitability_scores.items():
    if name in custom_models:
        suitability_matrix.append(scores)
        arch_labels.append(name.split('-')[0])

if suitability_matrix:
    im = ax9.imshow(suitability_matrix, cmap='RdYlGn', aspect='auto', vmin=0, vmax=1)
    ax9.set_xticks(range(len(use_cases)))
    ax9.set_xticklabels(use_cases, fontsize=9)
    ax9.set_yticks(range(len(arch_labels)))
    ax9.set_yticklabels(arch_labels, fontsize=9)
    ax9.set_title('Architecture Use Case Suitability', fontweight='bold')
    
    # Add text annotations
    for i in range(len(arch_labels)):
        for j in range(len(use_cases)):
            text = ax9.text(j, i, f'{suitability_matrix[i][j]:.2f}',
                           ha="center", va="center", color="black", fontweight='bold')

    # Add colorbar
    cbar = plt.colorbar(im, ax=ax9, shrink=0.8)
    cbar.set_label('Suitability Score')

plt.suptitle('Custom Architecture Design Framework Analysis', fontsize=18, fontweight='bold')
plt.savefig(os.path.join(results_dir, 'custom_architecture_analysis.png'), dpi=300, bbox_inches='tight')
plt.show()

print(f"\n✅ Custom architecture design framework successfully implemented and analyzed!")
```

---

## 7. Comprehensive Architecture Comparison and Performance Analysis

### 7.1 Cross-Architecture Benchmarking

```python
print("=== 7.1 Comprehensive Architecture Comparison ===\n")

# Collect all implemented models
all_models = {}

# Add ResNet models
for name, model in resnet_models.items():
    all_models[f"{name}"] = {
        'model': model,
        'type': 'ResNet',
        'parameters': count_parameters(model),
        'year': 2016 if '50' in name or '101' in name else 2015
    }

# Add DenseNet models  
for name, model in dense_models.items():
    all_models[f"{name}"] = {
        'model': model,
        'type': 'DenseNet',
        'parameters': count_parameters(model),
        'year': 2017
    }

# Add Custom models
for name, model in custom_models.items():
    all_models[f"Custom-{name.split('-')[0]}"] = {
        'model': model,
        'type': 'Custom',
        'parameters': count_parameters(model),
        'year': 2024  # Current implementations
    }

def benchmark_model(model, model_name, num_runs=5):
    """Comprehensive model benchmarking."""
    model.eval()
    
    benchmark_results = {
        'name': model_name,
        'parameters': count_parameters(model),
        'forward_times': [],
        'memory_usage_mb': 0,
        'flops_estimate': 0
    }
    
    # Test different input sizes
    input_sizes = [(1, 3, 224, 224), (8, 3, 224, 224), (32, 3, 224, 224)]
    
    with torch.no_grad():
        for batch_size in [1, 8, 32]:
            test_input = torch.randn(batch_size, 3, 224, 224)
            
            # Warmup
            for _ in range(3):
                _ = model(test_input)
            
            # Benchmark forward pass
            times = []
            for _ in range(num_runs):
                start_time = time.time()
                output = model(test_input)
                end_time = time.time()
                times.append((end_time - start_time) * 1000)  # Convert to ms
            
            benchmark_results['forward_times'].append({
                'batch_size': batch_size,
                'mean_time_ms': np.mean(times),
                'std_time_ms': np.std(times),
                'throughput_samples_per_sec': batch_size / (np.mean(times) / 1000)
            })
    
    # Estimate memory usage (rough calculation)
    benchmark_results['memory_usage_mb'] = benchmark_results['parameters'] * 4 / (1024 * 1024)  # 4 bytes per float32
    
    # Estimate FLOPs (very rough approximation)
    # This is a simplified calculation - real FLOP counting would require more detailed analysis
    if benchmark_results['parameters'] < 5e6:
        benchmark_results['flops_estimate'] = benchmark_results['parameters'] * 2  # Low complexity
    elif benchmark_results['parameters'] < 25e6:
        benchmark_results['flops_estimate'] = benchmark_results['parameters'] * 4  # Medium complexity
    else:
        benchmark_results['flops_estimate'] = benchmark_results['parameters'] * 8  # High complexity
    
    return benchmark_results

# Benchmark all models
print("🔬 Benchmarking All Architectures:")
print("=" * 80)

benchmark_results = []
total_models = len(all_models)

for i, (name, model_info) in enumerate(all_models.items()):
    print(f"({i+1:2d}/{total_models}) Benchmarking {name}... ", end='', flush=True)
    
    try:
        result = benchmark_model(model_info['model'], name)
        result['type'] = model_info['type']
        result['year'] = model_info['year']
        benchmark_results.append(result)
        print("✅")
    except Exception as e:
        print(f"❌ Error: {str(e)[:50]}")

# Create comprehensive comparison table
print(f"\n📊 Architecture Performance Summary:")
print("=" * 120)
print(f"{'Architecture':<25} {'Type':<10} {'Parameters':<12} {'Memory(MB)':<12} {'Time@B=1':<12} {'Time@B=32':<12} {'Throughput':<12}")
print("-" * 120)

for result in sorted(benchmark_results, key=lambda x: x['parameters']):
    time_b1 = result['forward_times'][0]['mean_time_ms'] if result['forward_times'] else 0
    time_b32 = result['forward_times'][2]['mean_time_ms'] if len(result['forward_times']) > 2 else 0
    throughput = result['forward_times'][2]['throughput_samples_per_sec'] if len(result['forward_times']) > 2 else 0
    
    print(f"{result['name']:<25} {result['type']:<10} {result['parameters']:<12,} "
          f"{result['memory_usage_mb']:<12.1f} {time_b1:<12.2f} {time_b32:<12.2f} {throughput:<12.1f}")

print(f"\nNotes: Time in ms, Throughput in samples/sec, Memory for model weights only")
```

### 7.2 Architecture Evolution Timeline and Analysis

```python
# Create architecture evolution timeline
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(18, 14))

# Plot 1: Architecture Evolution Timeline
years = [result['year'] for result in benchmark_results]
params = [result['parameters'] / 1e6 for result in benchmark_results]
names = [result['name'] for result in benchmark_results]
types = [result['type'] for result in benchmark_results]

# Color mapping for architecture types
type_colors = {'ResNet': '#FF6B6B', 'DenseNet': '#4ECDC4', 'Custom': '#45B7D1'}
colors_timeline = [type_colors.get(t, '#95A5A6') for t in types]

scatter = ax1.scatter(years, params, s=150, c=colors_timeline, alpha=0.7, edgecolors='black')

# Add labels for significant models
for i, name in enumerate(names):
    if any(x in name for x in ['ResNet-50', 'DenseNet-121', 'Custom-Efficient', 'Custom-Hybrid']):
        ax1.annotate(name.replace('Custom-', ''), (years[i], params[i]),
                    xytext=(5, 5), textcoords='offset points', fontsize=9, fontweight='bold')

ax1.set_xlabel('Year Introduced')
ax1.set_ylabel('Parameters (Millions)')
ax1.set_title('Architecture Evolution Timeline', fontweight='bold')
ax1.grid(True, alpha=0.3)

# Create legend
for arch_type, color in type_colors.items():
    ax1.scatter([], [], c=color, s=100, label=arch_type, alpha=0.7, edgecolors='black')
ax1.legend()

# Plot 2: Parameter Efficiency Analysis
if benchmark_results:
    batch32_times = []
    param_counts = []
    efficiency_names = []
    
    for result in benchmark_results:
        if len(result['forward_times']) > 2:
            batch32_times.append(result['forward_times'][2]['mean_time_ms'])
            param_counts.append(result['parameters'] / 1e6)
            efficiency_names.append(result['name'].replace('Custom-', ''))
    
    if batch32_times and param_counts:
        # Calculate efficiency (lower time per parameter is better)
        efficiency_scores = [p / t for p, t in zip(param_counts, batch32_times)]
        
        scatter2 = ax2.scatter(param_counts, batch32_times, s=150, 
                              c=efficiency_scores, cmap='RdYlGn', alpha=0.7, edgecolors='black')
        
        # Add model labels
        for i, name in enumerate(efficiency_names):
            if len(name) < 15:  # Only label shorter names for clarity
                ax2.annotate(name, (param_counts[i], batch32_times[i]),
                            xytext=(5, 5), textcoords='offset points', fontsize=8)
        
        ax2.set_xlabel('Parameters (Millions)')
        ax2.set_ylabel('Inference Time (ms, batch=32)')
        ax2.set_title('Parameter Efficiency Analysis', fontweight='bold')
        ax2.grid(True, alpha=0.3)
        
        # Add colorbar
        cbar = plt.colorbar(scatter2, ax=ax2)
        cbar.set_label('Efficiency Score\n(Parameters/Time)')

# Plot 3: Architecture Type Comparison
type_stats = {}
for result in benchmark_results:
    arch_type = result['type']
    if arch_type not in type_stats:
        type_stats[arch_type] = {
            'params': [],
            'times': [],
            'memory': []
        }
    
    type_stats[arch_type]['params'].append(result['parameters'] / 1e6)
    if result['forward_times']:
        type_stats[arch_type]['times'].append(result['forward_times'][0]['mean_time_ms'])
    type_stats[arch_type]['memory'].append(result['memory_usage_mb'])

# Box plot for parameter distribution by type
if type_stats:
    arch_types = list(type_stats.keys())
    param_data = [type_stats[t]['params'] for t in arch_types]
    
    bp = ax3.boxplot(param_data, labels=arch_types, patch_artist=True)
    
    # Color the boxes
    for patch, arch_type in zip(bp['boxes'], arch_types):
        patch.set_facecolor(type_colors.get(arch_type, '#95A5A6'))
        patch.set_alpha(0.7)
    
    ax3.set_ylabel('Parameters (Millions)')
    ax3.set_title('Parameter Distribution by Architecture Type', fontweight='bold')
    ax3.grid(True, alpha=0.3, axis='y')

# Plot 4: Performance vs Innovation Matrix
if benchmark_results:
    # Create innovation score based on year and features
    innovation_scores = []
    performance_scores = []
    
    for result in benchmark_results:
        # Innovation score (more recent = higher, custom = bonus)
        innovation = (result['year'] - 2014) * 10  # Base on year
        if result['type'] == 'Custom':
            innovation += 20  # Bonus for custom architectures
        if 'Hybrid' in result['name']:
            innovation += 10  # Bonus for hybrid designs
        
        innovation_scores.append(min(100, innovation))
        
        # Performance score (inverse of normalized time, with parameter penalty)
        if result['forward_times']:
            time_score = max(0, 100 - result['forward_times'][0]['mean_time_ms'])
            param_penalty = min(30, result['parameters'] / 1e6)  # Penalty for large models
            performance = max(0, time_score - param_penalty)
        else:
            performance = 50  # Default score
        
        performance_scores.append(performance)
    
    # Create scatter plot
    scatter3 = ax4.scatter(innovation_scores, performance_scores, s=150, 
                          c=[type_colors.get(r['type'], '#95A5A6') for r in benchmark_results], 
                          alpha=0.7, edgecolors='black')
    
    # Add quadrant labels
    ax4.axhline(y=np.mean(performance_scores), color='gray', linestyle='--', alpha=0.5)
    ax4.axvline(x=np.mean(innovation_scores), color='gray', linestyle='--', alpha=0.5)
    
    # Quadrant labels
    ax4.text(0.95, 0.95, 'High Innovation\nHigh Performance', transform=ax4.transAxes, 
             ha='right', va='top', bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.7))
    ax4.text(0.05, 0.95, 'Low Innovation\nHigh Performance', transform=ax4.transAxes, 
             ha='left', va='top', bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.7))
    ax4.text(0.95, 0.05, 'High Innovation\nLow Performance', transform=ax4.transAxes, 
             ha='right', va='bottom', bbox=dict(boxstyle='round', facecolor='lightcoral', alpha=0.7))
    ax4.text(0.05, 0.05, 'Low Innovation\nLow Performance', transform=ax4.transAxes, 
             ha='left', va='bottom', bbox=dict(boxstyle='round', facecolor='lightgray', alpha=0.7))
    
    # Add model labels for interesting points
    for i, result in enumerate(benchmark_results):
        if (innovation_scores[i] > np.mean(innovation_scores) + np.std(innovation_scores) or
            performance_scores[i] > np.mean(performance_scores) + np.std(performance_scores)):
            ax4.annotate(result['name'].replace('Custom-', ''), 
                        (innovation_scores[i], performance_scores[i]),
                        xytext=(5, 5), textcoords='offset points', fontsize=8)
    
    ax4.set_xlabel('Innovation Score')
    ax4.set_ylabel('Performance Score')
    ax4.set_title('Innovation vs Performance Matrix', fontweight='bold')
    ax4.grid(True, alpha=0.3)

plt.suptitle('Comprehensive Architecture Analysis Dashboard', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.savefig(os.path.join(results_dir, 'architecture_comparison_analysis.png'), dpi=300, bbox_inches='tight')
plt.show()
```

### 7.3 Architecture Selection Guide and Recommendations

```python
print("=== 7.3 Architecture Selection Guide ===\n")

def generate_architecture_recommendations():
    """Generate personalized architecture recommendations based on use cases."""
    
    recommendations = {
        'Mobile/Edge Deployment': {
            'primary': [],
            'secondary': [],
            'avoid': [],
            'criteria': 'Low parameters, fast inference, good accuracy'
        },
        'Research/Experimentation': {
            'primary': [],
            'secondary': [], 
            'avoid': [],
            'criteria': 'Flexibility, novel features, easy modification'
        },
        'Production Systems': {
            'primary': [],
            'secondary': [],
            'avoid': [],
            'criteria': 'Reliability, proven performance, stable training'
        },
        'Transfer Learning': {
            'primary': [],
            'secondary': [],
            'avoid': [],
            'criteria': 'Good feature representations, pre-trained availability'
        }
    }
    
    # Analyze each architecture for different use cases
    for result in benchmark_results:
        name = result['name']
        params = result['parameters'] / 1e6
        arch_type = result['type']
        
        # Mobile/Edge criteria
        if params < 10 and result['forward_times'] and result['forward_times'][0]['mean_time_ms'] < 50:
            recommendations['Mobile/Edge Deployment']['primary'].append(name)
        elif params < 25:
            recommendations['Mobile/Edge Deployment']['secondary'].append(name)
        else:
            recommendations['Mobile/Edge Deployment']['avoid'].append(name)
        
        # Research criteria
        if arch_type == 'Custom' or 'Hybrid' in name:
            recommendations['Research/Experimentation']['primary'].append(name)
        elif arch_type in ['DenseNet']:
            recommendations['Research/Experimentation']['secondary'].append(name)
        
        # Production criteria
        if arch_type == 'ResNet':
            recommendations['Production Systems']['primary'].append(name)
        elif arch_type == 'DenseNet':
            recommendations['Production Systems']['secondary'].append(name)
        elif arch_type == 'Custom':
            recommendations['Production Systems']['avoid'].append(name)
        
        # Transfer learning criteria
        if arch_type in ['ResNet', 'DenseNet']:
            recommendations['Transfer Learning']['primary'].append(name)
        elif params > 20:  # Larger models often have better representations
            recommendations['Transfer Learning']['secondary'].append(name)
    
    return recommendations

# Generate recommendations
recs = generate_architecture_recommendations()

print("🎯 Architecture Selection Guide:")
print("=" * 80)

for use_case, rec_data in recs.items():
    print(f"\n📋 {use_case}:")
    print(f"   Criteria: {rec_data['criteria']}")
    
    if rec_data['primary']:
        print(f"   ✅ Recommended: {', '.join(rec_data['primary'][:3])}")  # Limit to top 3
    
    if rec_data['secondary']:
        print(f"   ⚠️ Alternative: {', '.join(rec_data['secondary'][:3])}")
    
    if rec_data['avoid']:
        print(f"   ❌ Avoid: {', '.join(rec_data['avoid'][:2])}")  # Limit to top 2

# Create architecture decision tree
print(f"\n🌳 Architecture Decision Tree:")
print("=" * 60)

decision_tree = """
📱 DEPLOYMENT TARGET?
├── Mobile/Edge Device
│   ├── Parameters < 5M → Custom-Efficient, Small ResNets
│   └── Parameters < 25M → DenseNet-121, ResNet-18/34
│
🔬 RESEARCH/EXPERIMENTATION
│   ├── Novel Techniques → Custom-Hybrid, Custom-ViT
│   └── Established Methods → DenseNet variants, ResNet variants
│
🏭 PRODUCTION SYSTEM
│   ├── High Reliability → ResNet-50, ResNet-101
│   └── Memory Conscious → ResNet-18, DenseNet-121
│
🎯 TRANSFER LEARNING
│   ├── Computer Vision → ResNet-50, DenseNet-169
│   └── Specialized Domains → Custom architectures with pre-training
"""

print(decision_tree)

# Performance summary statistics
print(f"\n📊 Performance Summary Statistics:")
print("=" * 60)

if benchmark_results:
    # Calculate statistics
    all_params = [r['parameters'] / 1e6 for r in benchmark_results]
    all_times = [r['forward_times'][0]['mean_time_ms'] for r in benchmark_results if r['forward_times']]
    all_memory = [r['memory_usage_mb'] for r in benchmark_results]
    
    print(f"Parameter Range: {min(all_params):.1f}M - {max(all_params):.1f}M (mean: {np.mean(all_params):.1f}M)")
    if all_times:
        print(f"Inference Time: {min(all_times):.1f}ms - {max(all_times):.1f}ms (mean: {np.mean(all_times):.1f}ms)")
    print(f"Memory Usage: {min(all_memory):.1f}MB - {max(all_memory):.1f}MB (mean: {np.mean(all_memory):.1f}MB)")
    
    # Best performers in each category
    print(f"\n🏆 Best Performers:")
    
    # Fastest inference
    if all_times:
        fastest_idx = np.argmin([r['forward_times'][0]['mean_time_ms'] for r in benchmark_results if r['forward_times']])
        fastest_models = [r for r in benchmark_results if r['forward_times']]
        if fastest_models:
            print(f"   ⚡ Fastest Inference: {fastest_models[fastest_idx]['name']} ({min(all_times):.1f}ms)")
    
    # Most efficient (best accuracy/parameter ratio - simulated)
    efficiency_scores = []
    for r in benchmark_results:
        # Simulate efficiency as inverse relationship with parameters and time
        if r['forward_times']:
            eff_score = 1000 / (r['parameters'] / 1e6 + r['forward_times'][0]['mean_time_ms'])
            efficiency_scores.append((r['name'], eff_score))
    
    if efficiency_scores:
        most_efficient = max(efficiency_scores, key=lambda x: x[1])
        print(f"   🎯 Most Efficient: {most_efficient[0]} (score: {most_efficient[1]:.1f})")
    
    # Smallest model
    smallest_idx = np.argmin(all_params)
    print(f"   📱 Smallest Model: {benchmark_results[smallest_idx]['name']} ({min(all_params):.1f}M params)")
    
    # Most innovative (custom architectures with latest features)
    custom_models = [r for r in benchmark_results if r['type'] == 'Custom']
    if custom_models:
        most_innovative = max(custom_models, key=lambda x: len(x['name']))  # Longest name likely has more features
        print(f"   🚀 Most Innovative: {most_innovative['name']} (Custom architecture)")

print(f"\n💡 Key Insights and Recommendations:")
insights = [
    "• Start with ResNet-18/34 for most applications - reliable baseline",
    "• Use DenseNet for memory-constrained environments with good accuracy needs",
    "• Custom architectures excel in specialized domains or research",
    "• Consider parameter efficiency for deployment constraints",
    "• Hybrid approaches offer flexibility but may complicate deployment",
    "• Always validate on your specific dataset and requirements"
]

for insight in insights:
    print(f"  {insight}")

print(f"\n✅ Comprehensive architecture analysis complete!")
```

---

## 8. Summary and Next Steps

### 8.1 Project Summary and Key Learnings

```python
print("=== 8.1 Advanced Architectures - Project Summary ===\n")

def create_comprehensive_summary():
    """Create a comprehensive summary of all architectural implementations and analyses."""
    
    summary = {
        'analysis_timestamp': time.strftime("%Y-%m-%d %H:%M:%S"),
        'architectures_implemented': {},
        'total_models_created': len(all_models),
        'key_innovations_covered': [],
        'performance_insights': {},
        'educational_outcomes': []
    }
    
    # Categorize implemented architectures
    arch_categories = {
        'ResNet': [],
        'DenseNet': [],
        'Custom': [],
        'Components': []
    }
    
    for name, info in all_models.items():
        arch_categories[info['type']].append(name)
    
    summary['architectures_implemented'] = arch_categories
    
    # Key innovations
    summary['key_innovations_covered'] = [
        "Skip connections and residual learning",
        "Dense connections and feature reuse", 
        "Highway networks with learnable gates",
        "Attention mechanisms (Channel, Spatial, CBAM)",
        "Modular architecture design framework",
        "Custom block types (Basic, Bottleneck, Depthwise, Inverted)",
        "Multiple normalization techniques",
        "Various activation functions",
        "Flexible attention integration"
    ]
    
    # Performance insights from benchmarking
    if benchmark_results:
        param_range = [min(r['parameters'] for r in benchmark_results), 
                      max(r['parameters'] for r in benchmark_results)]
        
        summary['performance_insights'] = {
            'parameter_range': f"{param_range[0]/1e6:.1f}M - {param_range[1]/1e6:.1f}M",
            'architecture_types_tested': len(set(r['type'] for r in benchmark_results)),
            'total_benchmarks_run': len(benchmark_results),
            'fastest_architecture': min(benchmark_results, key=lambda x: x['forward_times'][0]['mean_time_ms'] if x['forward_times'] else float('inf'))['name'] if benchmark_results else "N/A"
        }
    
    # Educational outcomes
    summary['educational_outcomes'] = [
        "Understanding of vanishing gradient problem and solutions",
        "Implementation of state-of-the-art architectures from scratch",
        "Knowledge of architectural design principles and trade-offs", 
        "Experience with modular and flexible architecture design",
        "Practical benchmarking and performance analysis skills",
        "Architecture selection guidelines for different use cases"
    ]
    
    return summary

# Generate comprehensive summary
final_summary = create_comprehensive_summary()

print("🏗️ ADVANCED NEURAL ARCHITECTURES - COMPREHENSIVE SUMMARY")
print("=" * 80)

print(f"\n⏰ Analysis completed: {final_summary['analysis_timestamp']}")
print(f"🧠 Total models created: {final_summary['total_models_created']}")

print(f"\n📚 Architectures Implemented:")
for arch_type, models in final_summary['architectures_implemented'].items():
    print(f"   {arch_type}: {len(models)} variants")
    for model in models[:3]:  # Show first 3
        print(f"      • {model}")
    if len(models) > 3:
        print(f"      • ... and {len(models) - 3} more")

print(f"\n🔬 Key Innovations Covered:")
for i, innovation in enumerate(final_summary['key_innovations_covered'], 1):
    print(f"   {i:2d}. {innovation}")

if final_summary['performance_insights']:
    print(f"\n📊 Performance Analysis Results:")
    insights = final_summary['performance_insights']
    print(f"   • Parameter range: {insights['parameter_range']}")
    print(f"   • Architecture types tested: {insights['architecture_types_tested']}")
    print(f"   • Total benchmarks: {insights['total_benchmarks_run']}")
    print(f"   • Fastest architecture: {insights['fastest_architecture']}")

print(f"\n🎓 Educational Outcomes Achieved:")
for i, outcome in enumerate(final_summary['educational_outcomes'], 1):
    print(f"   {i}. {outcome}")

print(f"\n💾 Generated Artifacts:")
artifacts = [
    "vanishing_gradients_analysis.png - Gradient flow analysis",
    "resnet_analysis.png - ResNet architecture comparison", 
    "densenet_analysis.png - DenseNet feature analysis",
    "highway_attention_analysis.png - Advanced components analysis",
    "custom_architecture_analysis.png - Custom framework analysis",
    "architecture_comparison_analysis.png - Comprehensive comparison"
]

for artifact in artifacts:
    print(f"   📄 {artifact}")

print(f"\n🎯 Architecture Selection Guidelines:")
selection_guide = [
    "Mobile/Edge: Prioritize efficiency - Custom-Efficient, ResNet-18/34",
    "Research: Flexibility matters - Custom-Hybrid, DenseNet variants", 
    "Production: Reliability first - ResNet-50/101, proven architectures",
    "Transfer Learning: Rich features - ResNet-50, DenseNet-169",
    "Always validate on your specific dataset and requirements",
    "Consider the full pipeline: training time, deployment, maintenance"
]

for i, guide in enumerate(selection_guide, 1):
    print(f"   {i}. {guide}")

print(f"\n🚀 Next Steps and Advanced Topics:")
next_steps = [
    "Implement Transformer architectures (Vision Transformer, SWIN)",
    "Explore Neural Architecture Search (NAS) techniques",
    "Study efficient architectures (MobileNet, EfficientNet variations)",
    "Investigate architecture pruning and quantization",
    "Implement multi-scale and pyramid networks",
    "Explore specialized architectures (object detection, segmentation)",
    "Study emerging architectures (ConvNeXt, RegNet, etc.)"
]

for i, step in enumerate(next_steps, 1):
    print(f"   {i}. {step}")

print(f"\n💡 Key Takeaways:")
takeaways = [
    "Skip connections revolutionized deep learning by solving vanishing gradients",
    "Architecture choice significantly impacts model performance and efficiency",
    "Modern architectures combine multiple innovations (attention, normalization, etc.)",
    "Custom architectures can be built systematically using modular components",
    "Performance analysis is crucial for making informed architecture decisions",
    "No single architecture is optimal for all tasks - context matters"
]

for takeaway in takeaways:
    print(f"   • {takeaway}")

# Save comprehensive summary
import json

summary_path = os.path.join(results_dir, 'comprehensive_summary.json')
with open(summary_path, 'w') as f:
    json.dump(final_summary, f, indent=2, default=str)

print(f"\n💾 Complete summary saved to: {summary_path}")

# List all generated files
print(f"\n📁 Generated Files and Results:")
result_files = []
try:
    for file_path in os.listdir(results_dir):
        full_path = os.path.join(results_dir, file_path)
        if os.path.isfile(full_path):
            size_kb = os.path.getsize(full_path) / 1024
            result_files.append((file_path, size_kb))
    
    result_files.sort(key=lambda x: x[1], reverse=True)  # Sort by size
    
    total_size = sum(size for _, size in result_files)
    print(f"📊 Total results: {len(result_files)} files, {total_size:.1f} KB")
    
    for filename, size_kb in result_files:
        if filename.endswith('.png'):
            print(f"   🖼️  {filename} ({size_kb:.1f} KB)")
        elif filename.endswith('.json'):
            print(f"   📄 {filename} ({size_kb:.1f} KB)")
        else:
            print(f"   📋 {filename} ({size_kb:.1f} KB)")

except Exception as e:
    print(f"   Could not list result files: {e}")

print(f"\n🎉 Advanced Neural Network Architectures module completed successfully!")
print(f"🚀 Ready to tackle even more advanced topics in deep learning!")

# Final performance comparison table
print(f"\n📈 Final Architecture Performance Summary:")
print("=" * 100)
print(f"{'Architecture':<25} {'Type':<12} {'Parameters':<12} {'Efficiency':<12} {'Best Use Case':<25}")
print("-" * 100)

# Create final recommendations based on analysis
architecture_recommendations = {
    'ResNet-18': {'type': 'ResNet', 'use_case': 'General purpose, fast', 'efficiency': 'High'},
    'ResNet-50': {'type': 'ResNet', 'use_case': 'Production systems', 'efficiency': 'Medium'},
    'DenseNet-121': {'type': 'DenseNet', 'use_case': 'Memory efficiency', 'efficiency': 'High'}, 
    'Custom-Efficient': {'type': 'Custom', 'use_case': 'Mobile deployment', 'efficiency': 'Very High'},
    'Custom-Hybrid': {'type': 'Custom', 'use_case': 'Research & experiments', 'efficiency': 'Medium'}
}

for result in benchmark_results:
    name = result['name']
    arch_type = result['type']
    params = result['parameters']
    
    # Find best matching recommendation
    best_match = 'General purpose'
    efficiency = 'Medium'
    
    for arch_name, rec in architecture_recommendations.items():
        if arch_name in name or (arch_type == rec['type'] and 'Custom' not in name):
            best_match = rec['use_case']
            efficiency = rec['efficiency']
            break
    
    print(f"{name:<25} {arch_type:<12} {params:<12,} {efficiency:<12} {best_match:<25}")

print(f"\n" + "=" * 100)
print(f"🎯 Choose architectures based on your specific requirements:")
print(f"   • Speed → ResNet-18, Custom-Efficient")
print(f"   • Accuracy → ResNet-50, DenseNet-169") 
print(f"   • Memory → DenseNet-121, Custom-Efficient")
print(f"   • Innovation → Custom-Hybrid, Custom-ViT")
print(f"=" * 100)
```

### 8.2 Code Repository and Best Practices

```python
print("=== 8.2 Code Organization and Best Practices ===\n")

def create_code_organization_guide():
    """Generate a guide for organizing architecture implementations."""
    
    guide = {
        'project_structure': {
            'src/': {
                'architectures/': {
                    'resnet.py': 'ResNet implementations (BasicBlock, Bottleneck, ResNet)',
                    'densenet.py': 'DenseNet implementations (DenseLayer, DenseBlock, DenseNet)',
                    'highway.py': 'Highway Networks and gated connections',
                    'attention.py': 'Attention mechanisms (SE, CBAM, Spatial)',
                    'custom.py': 'Custom architecture framework',
                    '__init__.py': 'Package initialization and model factory functions'
                },
                'blocks/': {
                    'basic_blocks.py': 'Fundamental building blocks',
                    'attention_blocks.py': 'Attention mechanism implementations', 
                    'normalization.py': 'Various normalization techniques',
                    'activations.py': 'Activation function variants'
                },
                'utils/': {
                    'model_utils.py': 'Model utilities (parameter counting, etc.)',
                    'benchmark_utils.py': 'Benchmarking and performance analysis',
                    'visualization.py': 'Architecture visualization tools'
                }
            },
            'notebooks/': {
                'advanced_architectures.ipynb': 'This comprehensive tutorial',
                'architecture_experiments.ipynb': 'Experimental architectures',
                'performance_analysis.ipynb': 'Detailed performance studies'
            },
            'results/': {
                'advanced_architectures/': 'Generated analysis and visualizations',
                'benchmarks/': 'Performance benchmarking results',
                'models/': 'Saved model checkpoints'
            }
        },
        
        'best_practices': [
            "Use modular design with clear separation of concerns",
            "Implement proper weight initialization for each architecture",
            "Include comprehensive documentation and type hints",
            "Create factory functions for easy model instantiation",
            "Implement flexible configuration systems",
            "Add proper error handling and validation",
            "Use consistent naming conventions across architectures",
            "Include unit tests for critical components",
            "Provide pre-trained model loading capabilities",
            "Document architectural choices and trade-offs"
        ],
        
        'implementation_patterns': {
            'Factory Pattern': 'Use factory functions (resnet18(), densenet121()) for model creation',
            'Builder Pattern': 'Use configuration dicts for complex custom architectures',
            'Strategy Pattern': 'Pluggable components (normalization, activation, attention)',
            'Template Method': 'Base classes with customizable components',
            'Registry Pattern': 'Central registry for architecture variants'
        },
        
        'performance_optimization': [
            "Use inplace operations where possible (inplace=True)",
            "Implement memory-efficient attention mechanisms",
            "Consider gradient checkpointing for very deep networks",
            "Use mixed precision training when available",
            "Optimize batch normalization placement",
            "Implement efficient skip connections",
            "Consider architecture-specific optimizations"
        ]
    }
    
    return guide

# Generate and display code organization guide
org_guide = create_code_organization_guide()

print("📁 Recommended Project Structure:")
print("=" * 60)

def print_structure(structure, indent=0):
    """Recursively print project structure."""
    for key, value in structure.items():
        if isinstance(value, dict):
            print("  " * indent + f"📁 {key}")
            print_structure(value, indent + 1)
        else:
            print("  " * indent + f"📄 {key} - {value}")

print_structure(org_guide['project_structure'])

print(f"\n🎯 Implementation Best Practices:")
for i, practice in enumerate(org_guide['best_practices'], 1):
    print(f"   {i:2d}. {practice}")

print(f"\n🏗️ Design Patterns Used:")
for pattern, description in org_guide['implementation_patterns'].items():
    print(f"   • {pattern}: {description}")

print(f"\n⚡ Performance Optimization Tips:")
for i, tip in enumerate(org_guide['performance_optimization'], 1):
    print(f"   {i}. {tip}")

# Create a sample model factory implementation
print(f"\n🏭 Sample Model Factory Implementation:")
print("=" * 60)

sample_factory_code = '''
# src/architectures/__init__.py
from .resnet import resnet18, resnet34, resnet50, resnet101, resnet152
from .densenet import densenet121, densenet169, densenet201, densenet264
from .custom import CustomArchitecture
from .highway import HighwayNetwork

# Model registry for easy access
MODEL_REGISTRY = {
    # ResNet variants
    'resnet18': resnet18,
    'resnet34': resnet34, 
    'resnet50': resnet50,
    'resnet101': resnet101,
    'resnet152': resnet152,
    
    # DenseNet variants
    'densenet121': densenet121,
    'densenet169': densenet169,
    'densenet201': densenet201,
    'densenet264': densenet264,
    
    # Custom architectures
    'efficient_net': lambda **kwargs: CustomArchitecture(create_efficient_net_config(), **kwargs),
    'hybrid_net': lambda **kwargs: CustomArchitecture(create_hybrid_config(), **kwargs),
    
    # Highway networks
    'highway_net': HighwayNetwork,
}

def create_model(model_name: str, num_classes: int = 1000, **kwargs):
    """Factory function to create models by name."""
    if model_name not in MODEL_REGISTRY:
        available = list(MODEL_REGISTRY.keys())
        raise ValueError(f"Unknown model '{model_name}'. Available: {available}")
    
    model_fn = MODEL_REGISTRY[model_name]
    return model_fn(num_classes=num_classes, **kwargs)

def list_models():
    """List all available model architectures."""
    return list(MODEL_REGISTRY.keys())

# Usage examples:
# model = create_model('resnet50', num_classes=10)
# available_models = list_models()
'''

print(sample_factory_code)

print(f"\n📚 Additional Resources and References:")
references = [
    "📖 ResNet Paper: 'Deep Residual Learning for Image Recognition' (He et al., 2016)",
    "📖 DenseNet Paper: 'Densely Connected Convolutional Networks' (Huang et al., 2017)", 
    "📖 Highway Networks: 'Highway Networks' (Srivastava et al., 2015)",
    "📖 Attention: 'Squeeze-and-Excitation Networks' (Hu et al., 2018)",
    "📖 CBAM: 'CBAM: Convolutional Block Attention Module' (Woo et al., 2018)",
    "🔗 PyTorch Model Zoo: torchvision.models for reference implementations",
    "🔗 Papers With Code: https://paperswithcode.com/ for latest architectures",
    "🔗 Distill.pub: https://distill.pub/ for intuitive explanations"
]

for ref in references:
    print(f"   {ref}")
```

### 8.3 Future Directions and Advanced Topics

```python
print("=== 8.3 Future Directions and Advanced Topics ===\n")

future_topics = {
    'transformer_architectures': {
        'title': 'Vision Transformers and Attention-Based Models',
        'topics': [
            'Vision Transformer (ViT) implementation',
            'Swin Transformer for hierarchical vision',
            'DETR for object detection with transformers',
            'Hybrid CNN-Transformer architectures',
            'Efficient attention mechanisms (Linear, Performer)'
        ],
        'difficulty': 'Advanced',
        'prerequisites': 'Attention mechanisms, transformer basics'
    },
    
    'neural_architecture_search': {
        'title': 'Neural Architecture Search (NAS)',
        'topics': [
            'Differentiable architecture search (DARTS)',
            'Evolutionary architecture search',
            'Progressive architecture search',
            'Hardware-aware NAS',
            'Once-for-All networks and efficiency'
        ],
        'difficulty': 'Expert',
        'prerequisites': 'Advanced architectures, optimization theory'
    },
    
    'efficient_architectures': {
        'title': 'Efficient and Mobile Architectures',
        'topics': [
            'MobileNet v1/v2/v3 implementations',
            'EfficientNet family and compound scaling',
            'RegNet and design space exploration',
            'Architecture pruning and quantization',
            'Knowledge distillation techniques'
        ],
        'difficulty': 'Intermediate-Advanced',
        'prerequisites': 'Basic architectures, deployment considerations'
    },
    
    'specialized_architectures': {
        'title': 'Task-Specific Architectures', 
        'topics': [
            'U-Net for semantic segmentation',
            'Feature Pyramid Networks (FPN)',
            'YOLO for object detection',
            'Mask R-CNN for instance segmentation',
            'Siamese networks for metric learning'
        ],
        'difficulty': 'Intermediate-Advanced',
        'prerequisites': 'CNN fundamentals, task-specific knowledge'
    },
    
    'emerging_trends': {
        'title': 'Emerging Architecture Trends',
        'topics': [
            'ConvNeXt: A ConvNet for the 2020s',
            'MetaFormer architectures',
            'Multi-scale and pyramid networks',
            'Graph Neural Networks (GNNs)',
            'Capsule Networks and routing algorithms'
        ],
        'difficulty': 'Advanced-Expert',
        'prerequisites': 'Strong foundation in multiple architectures'
    },
    
    'optimization_techniques': {
        'title': 'Architecture Optimization and Analysis',
        'topics': [
            'Network architecture analysis tools',
            'FLOP counting and memory profiling',
            'Architecture visualization techniques',
            'Interpretability and feature analysis',
            'Robustness and adversarial considerations'
        ],
        'difficulty': 'Intermediate',
        'prerequisites': 'Basic architectures, performance analysis'
    }
}

print("🚀 Advanced Topics for Future Study:")
print("=" * 80)

for category, info in future_topics.items():
    print(f"\n📚 {info['title']}")
    print(f"   🎯 Difficulty: {info['difficulty']}")
    print(f"   📋 Prerequisites: {info['prerequisites']}")
    print(f"   📖 Topics:")
    for i, topic in enumerate(info['topics'], 1):
        print(f"      {i}. {topic}")

# Create a learning roadmap
print(f"\n🗺️ Recommended Learning Roadmap:")
print("=" * 60)

roadmap_stages = [
    {
        'stage': 'Foundation (Completed ✅)',
        'duration': '2-3 weeks',
        'topics': [
            'Skip connections and residual learning',
            'Dense connections and feature reuse',
            'Basic attention mechanisms',
            'Custom architecture design'
        ]
    },
    {
        'stage': 'Intermediate Extensions',
        'duration': '3-4 weeks', 
        'topics': [
            'Efficient architectures (MobileNet, EfficientNet)',
            'Specialized task architectures (U-Net, FPN)',
            'Advanced attention mechanisms',
            'Architecture optimization techniques'
        ]
    },
    {
        'stage': 'Advanced Applications',
        'duration': '4-6 weeks',
        'topics': [
            'Vision Transformers and hybrid models',
            'Neural Architecture Search basics',
            'Multi-modal architectures',
            'Real-world deployment considerations'
        ]
    },
    {
        'stage': 'Expert Level',
        'duration': '6-8 weeks',
        'topics': [
            'Advanced NAS techniques',
            'Novel architecture design',
            'Research-level implementations',
            'Contributing to open source projects'
        ]
    }
]

for i, stage in enumerate(roadmap_stages, 1):
    print(f"\n{i}. {stage['stage']} ({stage['duration']})")
    for topic in stage['topics']:
        print(f"   • {topic}")

# Practical project suggestions
print(f"\n💼 Practical Project Suggestions:")
project_suggestions = [
    {
        'title': 'Custom Architecture for Specific Domain',
        'description': 'Design a custom architecture for your specific use case (medical imaging, satellite imagery, etc.)',
        'skills': ['Architecture design', 'Domain knowledge', 'Performance optimization']
    },
    {
        'title': 'Architecture Comparison Study', 
        'description': 'Comprehensive comparison of architectures on multiple datasets with detailed analysis',
        'skills': ['Benchmarking', 'Statistical analysis', 'Scientific writing']
    },
    {
        'title': 'Efficient Architecture Implementation',
        'description': 'Implement and optimize a mobile-friendly architecture with deployment pipeline',
        'skills': ['Mobile optimization', 'Deployment', 'Performance profiling']
    },
    {
        'title': 'Architecture Visualization Tool',
        'description': 'Build a tool to visualize and analyze different neural architectures',
        'skills': ['Software development', 'Visualization', 'UI/UX design']
    },
    {
        'title': 'Research Replication',
        'description': 'Replicate results from a recent architecture paper and extend the work',
        'skills': ['Research methodology', 'Paper implementation', 'Experimental design']
    }
]

print("=" * 80)
for i, project in enumerate(project_suggestions, 1):
    print(f"\n{i}. {project['title']}")
    print(f"   📝 {project['description']}")
    print(f"   🛠️ Skills: {', '.join(project['skills'])}")

print(f"\n🎓 Congratulations on completing the Advanced Neural Network Architectures module!")
print(f"🌟 You now have a solid foundation in modern deep learning architectures.")
print(f"🚀 Ready to tackle cutting-edge research and real-world applications!")

print(f"\n" + "=" * 80)
print(f"📝 FINAL CHECKLIST - What You Can Now Do:")
checklist_items = [
    "✅ Understand and implement skip connections and residual learning",
    "✅ Build ResNet architectures from scratch with proper initialization",
    "✅ Implement DenseNet with dense connections and feature reuse",
    "✅ Create Highway Networks with learnable gating mechanisms", 
    "✅ Implement modern attention mechanisms (SE, CBAM, Spatial)",
    "✅ Design custom modular architectures for specific tasks",
    "✅ Benchmark and analyze architecture performance systematically",
    "✅ Make informed decisions about architecture selection",
    "✅ Understand trade-offs between accuracy, efficiency, and complexity",
    "✅ Apply best practices in architecture implementation and organization"
]

for item in checklist_items:
    print(f"   {item}")

print(f"=" * 80)
print(f"🎉 Advanced Neural Network Architectures - Complete! 🎉")
```

---

## Conclusion

This comprehensive notebook has taken you through the revolutionary world of advanced neural network architectures. You've mastered the fundamental concepts that enable modern deep learning, from understanding the vanishing gradient problem to implementing cutting-edge attention mechanisms and custom architecture frameworks.

### 🏆 What You've Accomplished

- **Theoretical Foundation**: Deep understanding of gradient flow and the innovations that solved fundamental deep learning challenges
- **Practical Implementation**: Hands-on experience building ResNet, DenseNet, Highway Networks, and custom architectures from scratch
- **Performance Analysis**: Comprehensive benchmarking and comparison frameworks for making informed architectural decisions
- **Design Principles**: Knowledge of modular design patterns and best practices for scalable architecture development

### 🚀 Ready for the Next Level

With this solid foundation in advanced architectures, you're now prepared to tackle:
- Vision Transformers and attention-based models
- Neural Architecture Search and automated design
- Specialized architectures for specific domains
- Cutting-edge research and novel architectural innovations

The journey through neural network architectures is ongoing, with new innovations constantly pushing the boundaries of what's possible. You now have the tools and knowledge to not just understand these advances, but to contribute to them.

**Keep building, keep learning, and keep pushing the boundaries of AI! 🌟**alization == 'layer':
                self.norm1 = nn.GroupNorm(1, channels // 4)
                self.norm2 = nn.GroupNorm(1, channels // 4)
                self.norm3 = nn.GroupNorm(1, channels)
            elif norm