# DLT Advanced Features Guide

This notebook demonstrates the advanced capabilities of the DLT framework including:
- 🚀 GPU acceleration and mixed precision training
- 📊 Performance profiling and monitoring
- 🎯 Advanced loss functions and smart weighting
- ⚡ Distributed training setup
- 🔧 Custom model architectures

Let's explore these powerful features!

In [1]:
import numpy as np
import torch
import torch.nn as nn
from sklearn.datasets import make_classification, make_regression
from sklearn.model_selection import train_test_split

# DLT imports
from dlt.core.config import DLTConfig
from dlt.core.model import DLTModel
from dlt.core.pipeline import train, evaluate
from dlt.utils.performance import GPUManager, PerformanceProfiler
from dlt.utils.loss import SmartLossFunction, CombinedLoss, ClassificationLoss

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU device: {torch.cuda.get_device_name(0)}")

2025-09-29 09:59:26.449477: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
  from .autonotebook import tqdm as notebook_tqdm


PyTorch version: 2.8.0+cu128
CUDA available: True
GPU device: NVIDIA RTX A5000


## 🚀 GPU Acceleration & Mixed Precision Training

DLT automatically detects and utilizes GPU resources with smart optimization:

In [2]:
# Create larger dataset for GPU training demonstration
X, y = make_classification(
    n_samples=10000,
    n_features=100,
    n_classes=5,
    n_informative=80,
    n_redundant=0,
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Dataset: {X_train.shape[0]} training samples, {X_train.shape[1]} features")
print(f"Classes: {len(np.unique(y))}")

Dataset: 8000 training samples, 100 features
Classes: 5


In [3]:
# Configure GPU training with mixed precision
gpu_config = DLTConfig(
    model_type='torch.nn.Sequential',
    model_params={
        'layers': [
            {'type': 'Linear', 'in_features': 100, 'out_features': 256},
            {'type': 'BatchNorm1d', 'num_features': 256},
            {'type': 'ReLU'},
            {'type': 'Dropout', 'p': 0.3},
            {'type': 'Linear', 'in_features': 256, 'out_features': 128},
            {'type': 'BatchNorm1d', 'num_features': 128},
            {'type': 'ReLU'},
            {'type': 'Dropout', 'p': 0.2},
            {'type': 'Linear', 'in_features': 128, 'out_features': 5}
        ]
    },
    training={
        'epochs': 20,
        'batch_size': 256,
        'optimizer': {'type': 'adamw', 'lr': 0.001, 'weight_decay': 0.01},
        'scheduler': {'type': 'cosine', 'T_max': 20},
        'early_stopping': {'patience': 5}
    },
    hardware={
        'device': 'auto',  # Automatically select GPU if available
        'num_workers': 4,
        'pin_memory': True
    },
    performance={
        'mixed_precision': {'enabled': 'auto'},  # Enable mixed precision if supported
        'compile': {'enabled': 'auto'},  # PyTorch 2.0 compilation
        'memory_optimization': True,
        'profiling': {'enabled': True, 'memory': True, 'compute': True}
    },
    experiment={
        'name': 'gpu_mixed_precision_demo'
    }
)

print("⚡ GPU Configuration:")
print(f"Device: {gpu_config.hardware['device']}")
print(f"Mixed precision: {gpu_config.performance['mixed_precision']['enabled']}")
print(f"Model compilation: {gpu_config.performance['compile']['enabled']}")
print(f"Profiling enabled: {gpu_config.performance['profiling']['enabled']}")

⚡ GPU Configuration:
Device: auto
Mixed precision: auto
Model compilation: auto
Profiling enabled: True


In [4]:
# Train with GPU acceleration
print("🚀 Starting GPU training...")

gpu_results = train(
    config=gpu_config,
    train_data=(X_train.astype(np.float32), y_train),
    test_data=(X_test.astype(np.float32), y_test),
    verbose=True
)

print(f"\n🏁 Training completed in {gpu_results['training_time']:.2f} seconds")
print(f"Test accuracy: {gpu_results.get('test_results', {}).get('accuracy', 'N/A')}")

# Show training history
history = gpu_results['training_results'].get('history', {})
if 'train_loss' in history:
    print(f"Final train loss: {history['train_loss'][-1]:.4f}")
if 'val_loss' in history:
    print(f"Final validation loss: {history['val_loss'][-1]:.4f}")

🚀 Starting GPU training...
Starting training with torch.nn.Sequential
Framework: torch
Device: cuda:0
Training torch.nn.Sequential...
Epoch 1/20, Loss: 1.1008
Epoch 5/20, Loss: 0.4381
Epoch 9/20, Loss: 0.2250
Epoch 13/20, Loss: 0.1464
Epoch 17/20, Loss: 0.1045
Training completed in 1.42s
Evaluating on test data...
Test accuracy: 0.9570
Training completed in 1.42 seconds

🏁 Training completed in 1.42 seconds
Test accuracy: 0.957


## 📊 Performance Profiling & Monitoring

Analyze training performance and identify bottlenecks:

In [4]:
# Initialize performance profiler (skip GPUManager due to distributed training conflicts in notebooks)
profiler = PerformanceProfiler(gpu_config.performance)

print("🔍 GPU Information:")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU count: {torch.cuda.device_count()}")
    print(f"Current device: {torch.cuda.current_device()}")
    print(f"Device name: {torch.cuda.get_device_name()}")
    print(f"Memory allocated: {torch.cuda.memory_allocated() / 1024**2:.1f} MB")

print("\n📈 Performance Profiling:")
with profiler.profile_step("data_preparation"):
    # Simulate data preparation work
    dummy_data = torch.randn(1000, 100)
    processed_data = dummy_data * 2 + 1

with profiler.profile_step("model_forward"):
    # Simulate model forward pass
    if torch.cuda.is_available():
        model_demo = torch.nn.Linear(100, 5).cuda()
        output = model_demo(dummy_data.cuda())
    else:
        model_demo = torch.nn.Linear(100, 5)
        output = model_demo(dummy_data)

# Get performance summary
performance_summary = profiler.get_performance_summary()
print(f"Profiling results: {performance_summary}")

# Identify bottlenecks
bottlenecks = profiler.identify_bottlenecks()
if bottlenecks:
    print(f"\n⚠️ Potential bottlenecks:")
    for bottleneck in bottlenecks:
        print(f"  - {bottleneck}")
else:
    print("\n✅ No significant bottlenecks detected")

# Additional GPU memory info
if torch.cuda.is_available():
    print(f"\n🔋 GPU Memory Usage:")
    print(f"Memory allocated: {torch.cuda.memory_allocated() / 1024**2:.1f} MB")
    print(f"Memory cached: {torch.cuda.memory_reserved() / 1024**2:.1f} MB")

🔍 GPU Information:
CUDA available: True
GPU count: 4
Current device: 0
Device name: NVIDIA RTX A5000
Memory allocated: 0.0 MB

📈 Performance Profiling:
Profiling results: {'step_timings': {'data_preparation': {'avg_ms': np.float64(193.0524290073663), 'min_ms': np.float64(193.0524290073663), 'max_ms': np.float64(193.0524290073663), 'std_ms': np.float64(0.0), 'count': 1}, 'model_forward': {'avg_ms': np.float64(84.02242185547948), 'min_ms': np.float64(84.02242185547948), 'max_ms': np.float64(84.02242185547948), 'std_ms': np.float64(0.0), 'count': 1}}, 'memory_usage': {'peak_memory_mb': np.float64(8.52880859375), 'avg_memory_mb': np.float64(4.264404296875), 'avg_memory_delta_mb': np.float64(4.264404296875), 'max_memory_delta_mb': np.float64(8.52880859375)}}

⚠️ Potential bottlenecks:
  - Operation 'data_preparation' is taking 69.7% of time
  - Operation 'model_forward' is taking 30.3% of time

🔋 GPU Memory Usage:
Memory allocated: 8.5 MB
Memory cached: 22.0 MB
Profiling results: {'step_tim

## 🎯 Advanced Loss Functions

DLT provides sophisticated loss functions with smart weighting:

In [5]:
# Create imbalanced dataset for loss function demo
X_imbalanced, y_imbalanced = make_classification(
    n_samples=5000,
    n_features=50,
    n_classes=3,
    n_informative=40,
    n_redundant=0,
    weights=[0.7, 0.2, 0.1],  # Imbalanced classes
    random_state=42
)

# Check class distribution
unique, counts = np.unique(y_imbalanced, return_counts=True)
class_distribution = dict(zip(unique, counts))
print(f"Class distribution: {class_distribution}")
print(f"Class imbalance ratio: {max(counts)/min(counts):.2f}")

Class distribution: {np.int64(0): np.int64(3478), np.int64(1): np.int64(1011), np.int64(2): np.int64(511)}
Class imbalance ratio: 6.81


In [6]:
# Configure advanced loss functions
advanced_loss_config = DLTConfig(
    model_type='torch.nn.Sequential',
    model_params={
        'layers': [
            {'type': 'Linear', 'in_features': 50, 'out_features': 128},
            {'type': 'ReLU'},
            {'type': 'Dropout', 'p': 0.3},
            {'type': 'Linear', 'in_features': 128, 'out_features': 64},
            {'type': 'ReLU'},
            {'type': 'Linear', 'in_features': 64, 'out_features': 3}
        ]
    },
    training={
        'epochs': 15,
        'batch_size': 128,
        'optimizer': {'type': 'adam', 'lr': 0.001},
        'loss': {
            'type': 'focal',  # Focal loss for imbalanced data
            'alpha': [0.25, 0.5, 0.8],  # Class-specific weights
            'gamma': 2.0,  # Focal loss gamma parameter
            'adaptive_weighting': True,  # Dynamic loss reweighting
            'label_smoothing': 0.1  # Label smoothing
        }
    },
    experiment={
        'name': 'advanced_loss_demo'
    }
)

print("🎯 Advanced Loss Configuration:")
print(f"Loss type: {advanced_loss_config.training['loss']['type']}")
print(f"Focal gamma: {advanced_loss_config.training['loss']['gamma']}")
print(f"Class weights: {advanced_loss_config.training['loss']['alpha']}")
print(f"Adaptive weighting: {advanced_loss_config.training['loss']['adaptive_weighting']}")
print(f"Label smoothing: {advanced_loss_config.training['loss']['label_smoothing']}")

🎯 Advanced Loss Configuration:
Loss type: focal
Focal gamma: 2.0
Class weights: [0.25, 0.5, 0.8]
Adaptive weighting: True
Label smoothing: 0.1


In [7]:
# Demonstrate combined loss functions
combined_loss_config = {
    'components': [
        {
            'category': 'classification',
            'type': 'cross_entropy',  # Use cross_entropy type with focal=True
            'weight': 0.7,
            'focal': True,
            'focal_gamma': 2.0,
            'focal_alpha': 0.25
        },
        {
            'category': 'classification', 
            'type': 'cross_entropy',
            'weight': 0.3,
            'label_smoothing': 0.1
        }
    ],
    'adaptive_weighting': True
}

combined_loss = CombinedLoss(combined_loss_config)
print(f"\n🔄 Combined Loss Functions:")
print(f"Number of components: {len(combined_loss.loss_functions)}")
print(f"Component names: {combined_loss.loss_names}")

# Test combined loss
dummy_predictions = torch.randn(32, 3)
dummy_targets = torch.randint(0, 3, (32,))

loss_value = combined_loss(dummy_predictions, dummy_targets)
print(f"Sample combined loss value: {loss_value.item():.4f}")


🔄 Combined Loss Functions:
Number of components: 2
Component names: ['loss_1', 'loss_2']
Sample combined loss value: 0.5045


## ⚡ Distributed Training Configuration

Configure multi-GPU distributed training:

In [8]:
# Configure distributed training
distributed_config = DLTConfig(
    model_type='torch.nn.Sequential',
    model_params={
        'layers': [
            {'type': 'Linear', 'in_features': 100, 'out_features': 512},
            {'type': 'BatchNorm1d', 'num_features': 512},
            {'type': 'ReLU'},
            {'type': 'Dropout', 'p': 0.2},
            {'type': 'Linear', 'in_features': 512, 'out_features': 256},
            {'type': 'BatchNorm1d', 'num_features': 256},
            {'type': 'ReLU'},
            {'type': 'Linear', 'in_features': 256, 'out_features': 5}
        ]
    },
    training={
        'epochs': 10,
        'batch_size': 512,  # Larger batch size for distributed training
        'optimizer': {'type': 'sgd', 'lr': 0.01, 'momentum': 0.9, 'weight_decay': 1e-4}
    },
    hardware={
        'device': 'auto',
        'gpu_ids': None,  # Use all available GPUs
        'distributed': {
            'enabled': False,  # Disable distributed for notebook demo to prevent hanging
            'backend': 'nccl',  # NCCL for GPU communication
            'find_unused_parameters': False,
            'static_graph': True  # Optimize for static computation graph
        }
    },
    performance={
        'mixed_precision': {'enabled': True},
        'compile': {'enabled': True, 'mode': 'max-autotune'},
        'memory_optimization': True
    },
    experiment={
        'name': 'distributed_training_demo'
    }
)

print("⚡ Distributed Training Configuration:")
print(f"Distributed backend: {distributed_config.hardware['distributed']['backend']}")
print(f"Distributed enabled: {distributed_config.hardware['distributed']['enabled']}")
print(f"Mixed precision: {distributed_config.performance['mixed_precision']['enabled']}")
print(f"Compilation mode: {distributed_config.performance['compile']['mode']}")
print(f"Batch size: {distributed_config.training['batch_size']}")

# Show hardware information without initializing distributed GPUManager
print(f"\n🖥️ Hardware Setup:")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"Available GPUs: {torch.cuda.device_count()}")
    for i in range(torch.cuda.device_count()):
        print(f"  GPU {i}: {torch.cuda.get_device_name(i)}")
print(f"Distributed training: {'Enabled' if distributed_config.hardware['distributed']['enabled'] else 'Disabled (for notebook compatibility)'}")

# Note about distributed training in production
print(f"\n📝 Note: In production environments with multiple GPUs, set:")
print(f"  distributed.enabled = 'auto' or True")
print(f"  Use proper process launching (torchrun, mpirun, etc.)")
print(f"  Current config is safe for notebook demonstration")

⚡ Distributed Training Configuration:
Distributed backend: nccl
Distributed enabled: False
Mixed precision: True
Compilation mode: max-autotune
Batch size: 512

🖥️ Hardware Setup:
CUDA available: True
Available GPUs: 4
  GPU 0: NVIDIA RTX A5000
  GPU 1: NVIDIA RTX A5000
  GPU 2: NVIDIA RTX A5000
  GPU 3: NVIDIA RTX A5000
Distributed training: Disabled (for notebook compatibility)

📝 Note: In production environments with multiple GPUs, set:
  distributed.enabled = 'auto' or True
  Use proper process launching (torchrun, mpirun, etc.)
  Current config is safe for notebook demonstration


## 🔧 Custom Model Architectures

Create sophisticated neural network architectures:

In [9]:
# Define a custom ResNet-style block configuration
resnet_style_config = DLTConfig(
    model_type='torch.nn.Sequential',
    model_params={
        'layers': [
            # Input layer
            {'type': 'Linear', 'in_features': 100, 'out_features': 256},
            {'type': 'BatchNorm1d', 'num_features': 256},
            {'type': 'ReLU'},
            
            # First residual-style block
            {'type': 'Linear', 'in_features': 256, 'out_features': 256},
            {'type': 'BatchNorm1d', 'num_features': 256},
            {'type': 'ReLU'},
            {'type': 'Dropout', 'p': 0.1},
            
            {'type': 'Linear', 'in_features': 256, 'out_features': 256},
            {'type': 'BatchNorm1d', 'num_features': 256},
            {'type': 'ReLU'},
            
            # Second block with dimension reduction
            {'type': 'Linear', 'in_features': 256, 'out_features': 128},
            {'type': 'BatchNorm1d', 'num_features': 128},
            {'type': 'ReLU'},
            {'type': 'Dropout', 'p': 0.2},
            
            {'type': 'Linear', 'in_features': 128, 'out_features': 128},
            {'type': 'BatchNorm1d', 'num_features': 128},
            {'type': 'ReLU'},
            
            # Output layer
            {'type': 'Linear', 'in_features': 128, 'out_features': 5}
        ]
    },
    training={
        'epochs': 25,
        'batch_size': 128,
        'optimizer': {
            'type': 'adamw',
            'lr': 0.001,
            'weight_decay': 0.01,
            'betas': [0.9, 0.999],
            'eps': 1e-8
        },
        'scheduler': {
            'type': 'cosine',
            'T_max': 25,
            'eta_min': 1e-6
        },
        'loss': {
            'type': 'cross_entropy',
            'label_smoothing': 0.1
        },
        'early_stopping': {
            'patience': 8,
            'min_delta': 1e-4
        },
        'gradient_clipping': {
            'enabled': True,
            'max_norm': 1.0
        }
    },
    experiment={
        'name': 'custom_architecture_demo',
        'tags': ['custom', 'resnet_style', 'deep']
    }
)

print("🔧 Custom Architecture Configuration:")
print(f"Total layers: {len(resnet_style_config.model_params['layers'])}")
print(f"Optimizer: {resnet_style_config.training['optimizer']['type']}")
print(f"Learning rate scheduler: {resnet_style_config.training['scheduler']['type']}")
print(f"Gradient clipping: {resnet_style_config.training['gradient_clipping']['enabled']}")
print(f"Label smoothing: {resnet_style_config.training['loss']['label_smoothing']}")

# Create and inspect the model
custom_model = DLTModel.from_config(resnet_style_config)
model_info = custom_model.get_model_info()

print(f"\n📋 Model Information:")
print(f"Framework: {model_info['framework']}")
print(f"Model type: {model_info['model_type']}")
print(f"Device: {model_info.get('device', 'Unknown')}")

# Count parameters
if hasattr(custom_model._model, 'parameters'):
    total_params = sum(p.numel() for p in custom_model._model.parameters())
    trainable_params = sum(p.numel() for p in custom_model._model.parameters() if p.requires_grad)
    print(f"Total parameters: {total_params:,}")
    print(f"Trainable parameters: {trainable_params:,}")

🔧 Custom Architecture Configuration:
Total layers: 18
Optimizer: adamw
Learning rate scheduler: cosine
Gradient clipping: True
Label smoothing: 0.1

📋 Model Information:
Framework: torch
Model type: torch.nn.Sequential
Device: cuda
Total parameters: 209,541
Trainable parameters: 209,541


## 📈 Training with Advanced Monitoring

Train the custom model with comprehensive monitoring:

In [10]:
# Train the custom architecture
print("🚀 Training custom architecture...")

custom_results = train(
    config=resnet_style_config,
    train_data=(X_train.astype(np.float32), y_train),
    test_data=(X_test.astype(np.float32), y_test),
    verbose=True
)

print(f"\n🏁 Custom Architecture Results:")
print(f"Training time: {custom_results['training_time']:.2f} seconds")
print(f"Test accuracy: {custom_results.get('test_results', {}).get('accuracy', 'N/A')}")

# Analyze training history
history = custom_results['training_results'].get('history', {})
if history:
    print(f"\n📊 Training History:")
    for metric, values in history.items():
        if isinstance(values, list) and len(values) > 0:
            print(f"  {metric}: {values[-1]:.4f} (final)")
            if len(values) > 1:
                improvement = values[-1] - values[0]
                print(f"    Improvement: {improvement:+.4f}")

# Model performance summary
test_results = custom_results.get('test_results', {})
if test_results:
    print(f"\n🎯 Test Performance:")
    for metric, value in test_results.items():
        if isinstance(value, (int, float)):
            print(f"  {metric}: {value:.4f}")

🚀 Training custom architecture...
Starting training with torch.nn.Sequential
Framework: torch
Device: cuda:0
Training torch.nn.Sequential...
Starting training with torch.nn.Sequential
Framework: torch
Device: cuda:0
Training torch.nn.Sequential...
Epoch 1/25, Loss: 0.6146
Epoch 1/25, Loss: 0.6146
Epoch 6/25, Loss: 0.0539
Epoch 6/25, Loss: 0.0539
Epoch 11/25, Loss: 0.0309
Epoch 11/25, Loss: 0.0309
Epoch 16/25, Loss: 0.0180
Epoch 16/25, Loss: 0.0180
Epoch 21/25, Loss: 0.0117
Epoch 21/25, Loss: 0.0117
Training completed in 5.39s
Evaluating on test data...
Test accuracy: 0.9230
Training completed in 5.39 seconds

🏁 Custom Architecture Results:
Training time: 5.39 seconds
Test accuracy: 0.923

🎯 Test Performance:
  accuracy: 0.9230
Training completed in 5.39s
Evaluating on test data...
Test accuracy: 0.9230
Training completed in 5.39 seconds

🏁 Custom Architecture Results:
Training time: 5.39 seconds
Test accuracy: 0.923

🎯 Test Performance:
  accuracy: 0.9230


## 🎯 Summary

In this advanced features guide, you've explored:

✅ **GPU Acceleration** - Automatic GPU detection and mixed precision training  
✅ **Performance Profiling** - Bottleneck detection and optimization insights  
✅ **Advanced Loss Functions** - Focal loss, combined losses, and smart weighting  
✅ **Distributed Training** - Multi-GPU setup and configuration  
✅ **Custom Architectures** - Complex model designs with advanced optimizers  

### Key Takeaways:

- 🚀 DLT automatically handles GPU optimization and mixed precision
- 📊 Built-in profiling helps identify and resolve performance bottlenecks
- 🎯 Advanced loss functions handle imbalanced data and complex objectives
- ⚡ Distributed training scales seamlessly across multiple GPUs
- 🔧 Flexible architecture definition supports complex models

### Next Steps:

- Explore `03_multi_framework_comparison.ipynb` for framework comparisons
- Check `04_production_deployment.ipynb` for deployment strategies
- Review the test suite in `tests/` for comprehensive usage examples

### 💡 Pro Tips:

1. **GPU Memory**: Use `memory_optimization: True` for large models
2. **Mixed Precision**: Enable `auto` mode for automatic FP16 optimization
3. **Profiling**: Enable profiling during development, disable in production
4. **Loss Functions**: Use focal loss for imbalanced datasets
5. **Distributed**: Increase batch size proportionally to GPU count

Happy advanced training! 🚀✨