# Lambda Cloud Tutorial

This tutorial demonstrates how to use Clustrix with Lambda Cloud for high-performance GPU computing and distributed machine learning.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ContextLab/clustrix/blob/master/docs/source/notebooks/lambda_cloud_tutorial.ipynb)

## Overview

Lambda Cloud specializes in GPU cloud computing and integrates well with Clustrix for ML workloads:

- **GPU-Optimized Instances**: High-performance NVIDIA GPUs (A100, H100, RTX)
- **Cost-Effective**: Competitive pricing for GPU computing
- **Simple Management**: Easy instance launching and management
- **Pre-configured Environments**: ML-ready software stacks
- **High-Speed Networking**: InfiniBand for multi-GPU communications
- **Persistent Storage**: Fast NVMe and network storage options
- **SSH Access**: Direct access for Clustrix integration
- **On-Demand and Reserved**: Flexible pricing models

## Prerequisites

1. Lambda Cloud account with GPU credits
2. SSH key pair for instance access
3. Lambda Cloud API key (optional)
4. Basic understanding of GPU computing

## Installation and Setup

Install Clustrix with Lambda Cloud dependencies:

In [None]:
# Install Clustrix with GPU and Lambda Cloud support
!pip install clustrix torch torchvision transformers datasets accelerate

# Import required libraries
import clustrix
from clustrix import cluster, configure
import torch
import numpy as np
import time
import json
import requests
import os

## Lambda Cloud Authentication and Setup

### Option 1: Web Console Setup

### Lambda Cloud Web Console Setup

1. **Create Account:**
   - Visit https://lambdalabs.com/service/gpu-cloud
   - Sign up and verify your account
   - Add billing information and credits

2. **Add SSH Key:**
   - Go to https://cloud.lambdalabs.com/ssh-keys
   - Click "Add SSH Key"
   - Paste your public key (cat ~/.ssh/id_rsa.pub)
   - Give it a descriptive name

3. **Launch Instance:**
   - Go to https://cloud.lambdalabs.com/instances
   - Click "Launch instance"
   - Select instance type (A100, H100, RTX 6000 Ada, etc.)
   - Choose region (closest to you for best performance)
   - Select your SSH key
   - Launch the instance

4. **Instance Types Available:**
   - RTX 6000 Ada: 48GB VRAM, ~$0.75/hour
   - A10: 24GB VRAM, ~$0.60/hour  
   - A100 (40GB): 40GB VRAM, ~$1.10/hour
   - A100 (80GB): 80GB VRAM, ~$1.40/hour
   - H100: 80GB VRAM, ~$2.50/hour (when available)

5. **Access Instance:**
   - Wait for instance to be "Running"
   - Note the public IP address
   - SSH: ssh ubuntu@<PUBLIC_IP>

**Follow this guide to set up your Lambda Cloud account and launch your first GPU instance.**

### Option 2: API-Based Setup

In [None]:
import requests
import os

class LambdaCloudAPI:
    def __init__(self, api_key=None):
        self.api_key = api_key or os.getenv('LAMBDA_API_KEY')
        self.base_url = 'https://cloud.lambdalabs.com/api/v1'
        self.headers = {
            'Authorization': f'Bearer {self.api_key}',
            'Content-Type': 'application/json'
        }
    
    def list_instance_types(self):
        """List available instance types."""
        response = requests.get(f'{self.base_url}/instance-types', headers=self.headers)
        return response.json()
    
    def list_instances(self):
        """List running instances."""
        response = requests.get(f'{self.base_url}/instances', headers=self.headers)
        return response.json()
    
    def launch_instance(self, instance_type, ssh_key_name, region='us-east-1', name=None):
        """Launch a new instance."""
        data = {
            'instance_type_name': instance_type,
            'ssh_key_names': [ssh_key_name],
            'region_name': region
        }
        if name:
            data['name'] = name
        
        response = requests.post(f'{self.base_url}/instance-operations/launch', 
                               headers=self.headers, json=data)
        return response.json()
    
    def terminate_instance(self, instance_id):
        """Terminate an instance."""
        data = {'instance_ids': [instance_id]}
        response = requests.post(f'{self.base_url}/instance-operations/terminate',
                               headers=self.headers, json=data)
        return response.json()
    
    def get_instance_details(self, instance_id):
        """Get detailed information about an instance."""
        instances = self.list_instances()
        for instance in instances.get('data', []):
            if instance['id'] == instance_id:
                return instance
        return None

# Example usage:
# api = LambdaCloudAPI()
# instance_types = api.list_instance_types()
# print(json.dumps(instance_types, indent=2))

### Lambda Cloud API Setup Guide

#### CLI Setup Steps

1. **Get API Key:**
   - Go to https://cloud.lambdalabs.com/api-keys
   - Generate a new API key
   - Set as environment variable: `export LAMBDA_API_KEY="your-key"`

2. **Install Lambda Cloud CLI:**
   ```bash
   pip install lambda-cloud
   lambda-cloud configure  # Enter your API key
   ```

3. **Basic CLI Commands:**
   ```bash
   # List available instance types
   lambda-cloud instance-types list
   
   # List available regions
   lambda-cloud regions list
   
   # Launch instance
   lambda-cloud instance launch \
     --instance-type a100 \
     --ssh-key-name your-key-name \
     --region us-east-1
   
   # List running instances
   lambda-cloud instance list
   
   # Terminate instance
   lambda-cloud instance terminate <INSTANCE_ID>
   ```

#### Python API Client

## Configure Clustrix for Lambda Cloud

In [None]:
# Configure Clustrix to use your Lambda Cloud instance
configure(
    cluster_type="ssh",
    cluster_host="your-lambda-instance-ip",  # Replace with actual IP
    username="ubuntu",  # Default Lambda Cloud user
    key_file="~/.ssh/id_rsa",  # Your private SSH key
    remote_work_dir="/tmp/clustrix",
    package_manager="auto",  # Will use uv if available
    default_cores=8,  # Lambda instances typically have 8+ cores
    default_memory="32GB",  # Generous memory allocation
    default_time="02:00:00",  # Longer timeout for GPU tasks
    environment_variables={
        "CUDA_VISIBLE_DEVICES": "0",  # Use first GPU
        "NVIDIA_VISIBLE_DEVICES": "all"
    }
)

**Replace `your-lambda-instance-ip` with the actual IP address from your Lambda Cloud instance.**

### GPU Verification and Setup

In [None]:
@cluster(cores=2, memory="8GB")
def verify_lambda_gpu_setup():
    """Verify GPU availability and setup on Lambda Cloud instance."""
    import torch
    import subprocess
    import platform
    
    # System information
    system_info = {
        'platform': platform.platform(),
        'python_version': platform.python_version(),
        'architecture': platform.architecture()[0]
    }
    
    # PyTorch and CUDA info
    torch_info = {
        'pytorch_version': torch.__version__,
        'cuda_available': torch.cuda.is_available(),
        'cuda_version': torch.version.cuda if torch.cuda.is_available() else None,
        'cudnn_version': torch.backends.cudnn.version() if torch.cuda.is_available() else None,
        'device_count': torch.cuda.device_count() if torch.cuda.is_available() else 0
    }
    
    # GPU details
    gpu_info = []
    if torch.cuda.is_available():
        for i in range(torch.cuda.device_count()):
            props = torch.cuda.get_device_properties(i)
            gpu_info.append({
                'device_id': i,
                'name': props.name,
                'total_memory_gb': props.total_memory / (1024**3),
                'major': props.major,
                'minor': props.minor,
                'multiprocessor_count': props.multi_processor_count
            })
    
    # NVIDIA-SMI output
    nvidia_smi = None
    try:
        result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
        if result.returncode == 0:
            nvidia_smi = result.stdout
    except FileNotFoundError:
        nvidia_smi = "nvidia-smi not found"
    
    # Test GPU computation
    gpu_test_result = None
    if torch.cuda.is_available():
        try:
            # Simple GPU computation test
            device = torch.device('cuda:0')
            x = torch.randn(1000, 1000, device=device)
            y = torch.randn(1000, 1000, device=device)
            
            start_time = torch.cuda.Event(enable_timing=True)
            end_time = torch.cuda.Event(enable_timing=True)
            
            start_time.record()
            z = torch.mm(x, y)
            torch.cuda.synchronize()
            end_time.record()
            torch.cuda.synchronize()
            
            gpu_test_result = {
                'test_passed': True,
                'computation_time_ms': start_time.elapsed_time(end_time),
                'result_shape': z.shape,
                'memory_allocated_mb': torch.cuda.memory_allocated() / (1024**2),
                'memory_reserved_mb': torch.cuda.memory_reserved() / (1024**2)
            }
        except Exception as e:
            gpu_test_result = {
                'test_passed': False,
                'error': str(e)
            }
    
    return {
        'system_info': system_info,
        'torch_info': torch_info,
        'gpu_info': gpu_info,
        'nvidia_smi': nvidia_smi,
        'gpu_test': gpu_test_result
    }

# Run GPU verification
# gpu_status = verify_lambda_gpu_setup()
# print(json.dumps(gpu_status, indent=2, default=str))
print("GPU verification function defined. Uncomment the lines above to run on Lambda Cloud.")

## Example 1: Distributed Deep Learning Training

In [None]:
@cluster(cores=8, memory="16GB", time="01:30:00")
def lambda_deep_learning_training(model_config, training_config):
    """Train a deep learning model on Lambda Cloud GPU."""
    import torch
    import torch.nn as nn
    import torch.optim as optim
    from torch.utils.data import DataLoader, TensorDataset
    import numpy as np
    import time
    
    # Set device
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f"Training on device: {device}")
    
    # Create synthetic dataset
    n_samples = training_config['n_samples']
    n_features = training_config['n_features']
    n_classes = training_config['n_classes']
    
    # Generate random data
    X = torch.randn(n_samples, n_features)
    y = torch.randint(0, n_classes, (n_samples,))
    
    # Create dataset and dataloader
    dataset = TensorDataset(X, y)
    dataloader = DataLoader(
        dataset, 
        batch_size=training_config['batch_size'], 
        shuffle=True
    )
    
    # Define model architecture
    class DeepNet(nn.Module):
        def __init__(self, input_size, hidden_sizes, output_size, dropout=0.2):
            super(DeepNet, self).__init__()
            
            layers = []
            prev_size = input_size
            
            for hidden_size in hidden_sizes:
                layers.extend([
                    nn.Linear(prev_size, hidden_size),
                    nn.ReLU(),
                    nn.BatchNorm1d(hidden_size),
                    nn.Dropout(dropout)
                ])
                prev_size = hidden_size
            
            layers.append(nn.Linear(prev_size, output_size))
            self.network = nn.Sequential(*layers)
        
        def forward(self, x):
            return self.network(x)
    
    # Create model
    model = DeepNet(
        input_size=n_features,
        hidden_sizes=model_config['hidden_sizes'],
        output_size=n_classes,
        dropout=model_config.get('dropout', 0.2)
    ).to(device)
    
    # Loss and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(
        model.parameters(), 
        lr=training_config['learning_rate'],
        weight_decay=training_config.get('weight_decay', 1e-4)
    )
    
    # Training loop
    model.train()
    training_start = time.time()
    
    epoch_losses = []
    epoch_accuracies = []
    
    for epoch in range(training_config['epochs']):
        epoch_loss = 0.0
        correct = 0
        total = 0
        
        for batch_idx, (data, target) in enumerate(dataloader):
            data, target = data.to(device), target.to(device)
            
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()
            
            epoch_loss += loss.item()
            _, predicted = torch.max(output.data, 1)
            total += target.size(0)
            correct += (predicted == target).sum().item()
        
        avg_loss = epoch_loss / len(dataloader)
        accuracy = 100.0 * correct / total
        
        epoch_losses.append(avg_loss)
        epoch_accuracies.append(accuracy)
        
        if epoch % 10 == 0 or epoch == training_config['epochs'] - 1:
            print(f'Epoch {epoch+1}/{training_config["epochs"]}: '
                  f'Loss: {avg_loss:.4f}, Accuracy: {accuracy:.2f}%')
    
    training_time = time.time() - training_start
    
    # Model evaluation
    model.eval()
    with torch.no_grad():
        test_data = torch.randn(1000, n_features).to(device)
        test_output = model(test_data)
        test_predictions = torch.max(test_output, 1)[1]
    
    # Memory usage
    memory_info = {}
    if torch.cuda.is_available():
        memory_info = {
            'allocated_mb': torch.cuda.memory_allocated() / (1024**2),
            'reserved_mb': torch.cuda.memory_reserved() / (1024**2),
            'max_allocated_mb': torch.cuda.max_memory_allocated() / (1024**2)
        }
    
    return {
        'training_completed': True,
        'device_used': str(device),
        'model_parameters': sum(p.numel() for p in model.parameters()),
        'trainable_parameters': sum(p.numel() for p in model.parameters() if p.requires_grad),
        'training_time': training_time,
        'final_loss': epoch_losses[-1],
        'final_accuracy': epoch_accuracies[-1],
        'best_accuracy': max(epoch_accuracies),
        'epoch_losses': epoch_losses,
        'epoch_accuracies': epoch_accuracies,
        'memory_info': memory_info,
        'model_architecture': str(model)
    }

# Example configuration
model_config = {
    'hidden_sizes': [512, 256, 128, 64],
    'dropout': 0.3
}

training_config = {
    'n_samples': 10000,
    'n_features': 100,
    'n_classes': 10,
    'batch_size': 64,
    'epochs': 50,
    'learning_rate': 0.001,
    'weight_decay': 1e-4
}

# Run training
# result = lambda_deep_learning_training(model_config, training_config)
# print(f"Training completed! Final accuracy: {result['final_accuracy']:.2f}%")
# print(f"Training time: {result['training_time']:.2f} seconds")
# print(f"GPU memory used: {result['memory_info'].get('max_allocated_mb', 0):.1f} MB")

print("Deep learning training function defined. Uncomment the lines above to run on Lambda Cloud.")

## Example 2: Transformer Model Fine-tuning

In [None]:
@cluster(cores=8, memory="32GB", time="02:00:00")
def lambda_transformer_finetuning(model_name, training_params):
    """Fine-tune a transformer model on Lambda Cloud GPU."""
    import torch
    from transformers import (
        AutoTokenizer, AutoModelForSequenceClassification,
        TrainingArguments, Trainer, DataCollatorWithPadding
    )
    from datasets import Dataset
    import numpy as np
    import time
    
    # Set device
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f"Fine-tuning on device: {device}")
    
    # Load tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=training_params['num_labels']
    )
    
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    
    # Create synthetic dataset
    def generate_synthetic_text_data(n_samples, num_labels):
        """Generate synthetic text classification data."""
        
        # Simple text templates for different classes
        templates = {
            0: ["This is a positive example about {}", "Great work on {}", "Excellent {}"],
            1: ["This is a negative example about {}", "Poor {}", "Terrible {}"],
            2: ["This is a neutral example about {}", "Average {}", "Okay {}"] if num_labels > 2 else []
        }
        
        topics = ["technology", "sports", "food", "movies", "music", "books", "travel", "science"]
        
        texts = []
        labels = []
        
        for _ in range(n_samples):
            label = np.random.randint(0, num_labels)
            template = np.random.choice(templates[label])
            topic = np.random.choice(topics)
            text = template.format(topic)
            
            texts.append(text)
            labels.append(label)
        
        return texts, labels
    
    # Generate data
    train_texts, train_labels = generate_synthetic_text_data(
        training_params['train_samples'], training_params['num_labels']
    )
    eval_texts, eval_labels = generate_synthetic_text_data(
        training_params['eval_samples'], training_params['num_labels']
    )
    
    # Tokenize data
    def tokenize_function(examples):
        return tokenizer(
            examples['text'],
            truncation=True,
            padding=True,
            max_length=training_params.get('max_length', 512)
        )
    
    # Create datasets
    train_dataset = Dataset.from_dict({'text': train_texts, 'labels': train_labels})
    eval_dataset = Dataset.from_dict({'text': eval_texts, 'labels': eval_labels})
    
    train_dataset = train_dataset.map(tokenize_function, batched=True)
    eval_dataset = eval_dataset.map(tokenize_function, batched=True)
    
    # Data collator
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
    
    # Training arguments
    training_args = TrainingArguments(
        output_dir='/tmp/results',
        num_train_epochs=training_params.get('epochs', 3),
        per_device_train_batch_size=training_params.get('batch_size', 8),
        per_device_eval_batch_size=training_params.get('eval_batch_size', 8),
        warmup_steps=training_params.get('warmup_steps', 100),
        weight_decay=training_params.get('weight_decay', 0.01),
        learning_rate=training_params.get('learning_rate', 2e-5),
        logging_dir='/tmp/logs',
        logging_steps=10,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        greater_is_better=False,
        fp16=torch.cuda.is_available(),  # Use mixed precision if GPU available
        dataloader_pin_memory=torch.cuda.is_available(),
        remove_unused_columns=False
    )
    
    # Define compute metrics
    def compute_metrics(eval_pred):
        predictions, labels = eval_pred
        predictions = np.argmax(predictions, axis=1)
        accuracy = (predictions == labels).mean()
        return {'accuracy': accuracy}
    
    # Create trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics
    )
    
    # Training
    start_time = time.time()
    train_result = trainer.train()
    training_time = time.time() - start_time
    
    # Final evaluation
    eval_result = trainer.evaluate()
    
    # Memory usage
    memory_info = {}
    if torch.cuda.is_available():
        memory_info = {
            'allocated_mb': torch.cuda.memory_allocated() / (1024**2),
            'reserved_mb': torch.cuda.memory_reserved() / (1024**2),
            'max_allocated_mb': torch.cuda.max_memory_allocated() / (1024**2)
        }
    
    # Model info
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    
    return {
        'model_name': model_name,
        'device_used': str(device),
        'training_completed': True,
        'training_time': training_time,
        'total_parameters': total_params,
        'trainable_parameters': trainable_params,
        'train_loss': train_result.training_loss,
        'eval_loss': eval_result['eval_loss'],
        'eval_accuracy': eval_result['eval_accuracy'],
        'train_steps': train_result.global_step,
        'memory_info': memory_info,
        'training_params': training_params
    }

# Example configuration
training_params = {
    'num_labels': 3,
    'train_samples': 1000,
    'eval_samples': 200,
    'epochs': 3,
    'batch_size': 16,
    'eval_batch_size': 32,
    'learning_rate': 2e-5,
    'weight_decay': 0.01,
    'warmup_steps': 100,
    'max_length': 256
}

# Run fine-tuning
# result = lambda_transformer_finetuning('distilbert-base-uncased', training_params)
# print(f"Fine-tuning completed! Final accuracy: {result['eval_accuracy']:.4f}")
# print(f"Training time: {result['training_time']:.2f} seconds")
# print(f"Model parameters: {result['total_parameters']:,}")
# print(f"GPU memory used: {result['memory_info'].get('max_allocated_mb', 0):.1f} MB")

print("Transformer fine-tuning function defined. Uncomment the lines above to run on Lambda Cloud.")

## Example 3: Computer Vision with Large Datasets

In [None]:
@cluster(cores=8, memory="32GB", time="01:30:00")
def lambda_computer_vision_training(model_config, data_config):
    """Train a computer vision model on Lambda Cloud GPU."""
    import torch
    import torch.nn as nn
    import torch.optim as optim
    import torchvision
    import torchvision.transforms as transforms
    from torch.utils.data import DataLoader, TensorDataset
    import numpy as np
    import time
    
    # Set device
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f"Training computer vision model on device: {device}")
    
    # Data augmentation and preprocessing
    transform_train = transforms.Compose([
        transforms.ToPILImage(),
        transforms.RandomResizedCrop(data_config['image_size']),
        transforms.RandomHorizontalFlip(p=0.5),
        transforms.RandomRotation(degrees=15),
        transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])
    
    transform_val = transforms.Compose([
        transforms.ToPILImage(),
        transforms.Resize((data_config['image_size'], data_config['image_size'])),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])
    
    # Generate synthetic image data
    def create_synthetic_images(n_samples, image_size, n_channels, n_classes):
        """Create synthetic image dataset."""
        images = np.random.randint(0, 256, (n_samples, image_size, image_size, n_channels), dtype=np.uint8)
        labels = np.random.randint(0, n_classes, n_samples)
        return images, labels
    
    # Create datasets
    train_images, train_labels = create_synthetic_images(
        data_config['train_samples'],
        data_config['image_size'],
        data_config['n_channels'],
        data_config['n_classes']
    )
    
    val_images, val_labels = create_synthetic_images(
        data_config['val_samples'],
        data_config['image_size'],
        data_config['n_channels'],
        data_config['n_classes']
    )
    
    # Custom dataset class
    class SyntheticImageDataset(torch.utils.data.Dataset):
        def __init__(self, images, labels, transform=None):
            self.images = images
            self.labels = labels
            self.transform = transform
        
        def __len__(self):
            return len(self.images)
        
        def __getitem__(self, idx):
            image = self.images[idx]
            label = self.labels[idx]
            
            if self.transform:
                image = self.transform(image)
            else:
                image = torch.from_numpy(image).permute(2, 0, 1).float() / 255.0
            
            return image, label
    
    # Create data loaders
    train_dataset = SyntheticImageDataset(train_images, train_labels, transform_train)
    val_dataset = SyntheticImageDataset(val_images, val_labels, transform_val)
    
    train_loader = DataLoader(
        train_dataset,
        batch_size=data_config['batch_size'],
        shuffle=True,
        num_workers=4,
        pin_memory=True if torch.cuda.is_available() else False
    )
    
    val_loader = DataLoader(
        val_dataset,
        batch_size=data_config['batch_size'],
        shuffle=False,
        num_workers=4,
        pin_memory=True if torch.cuda.is_available() else False
    )
    
    # Model definition
    if model_config['model_type'] == 'resnet':
        if model_config['pretrained']:
            model = torchvision.models.resnet50(pretrained=True)
            model.fc = nn.Linear(model.fc.in_features, data_config['n_classes'])
        else:
            model = torchvision.models.resnet50(pretrained=False, num_classes=data_config['n_classes'])
    elif model_config['model_type'] == 'efficientnet':
        if model_config['pretrained']:
            model = torchvision.models.efficientnet_b0(pretrained=True)
            model.classifier[1] = nn.Linear(model.classifier[1].in_features, data_config['n_classes'])
        else:
            model = torchvision.models.efficientnet_b0(pretrained=False, num_classes=data_config['n_classes'])
    else:
        raise ValueError(f"Unsupported model type: {model_config['model_type']}")
    
    model = model.to(device)
    
    # Loss and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.AdamW(
        model.parameters(),
        lr=model_config['learning_rate'],
        weight_decay=model_config['weight_decay']
    )
    
    # Learning rate scheduler
    scheduler = optim.lr_scheduler.CosineAnnealingLR(
        optimizer, T_max=model_config['epochs']
    )
    
    # Training loop
    start_time = time.time()
    train_losses = []
    val_accuracies = []
    
    for epoch in range(model_config['epochs']):
        # Training phase
        model.train()
        running_loss = 0.0
        
        for batch_idx, (data, target) in enumerate(train_loader):
            data, target = data.to(device), target.to(device)
            
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()
            
            running_loss += loss.item()
        
        avg_train_loss = running_loss / len(train_loader)
        train_losses.append(avg_train_loss)
        
        # Validation phase
        model.eval()
        correct = 0
        total = 0
        val_loss = 0.0
        
        with torch.no_grad():
            for data, target in val_loader:
                data, target = data.to(device), target.to(device)
                output = model(data)
                val_loss += criterion(output, target).item()
                
                _, predicted = torch.max(output.data, 1)
                total += target.size(0)
                correct += (predicted == target).sum().item()
        
        val_accuracy = 100.0 * correct / total
        val_accuracies.append(val_accuracy)
        
        scheduler.step()
        
        if epoch % 5 == 0 or epoch == model_config['epochs'] - 1:
            print(f'Epoch {epoch+1}/{model_config["epochs"]}: '
                  f'Train Loss: {avg_train_loss:.4f}, '
                  f'Val Accuracy: {val_accuracy:.2f}%, '
                  f'LR: {scheduler.get_last_lr()[0]:.6f}')
    
    training_time = time.time() - start_time
    
    # Memory usage
    memory_info = {}
    if torch.cuda.is_available():
        memory_info = {
            'allocated_mb': torch.cuda.memory_allocated() / (1024**2),
            'reserved_mb': torch.cuda.memory_reserved() / (1024**2),
            'max_allocated_mb': torch.cuda.max_memory_allocated() / (1024**2)
        }
    
    return {
        'training_completed': True,
        'device_used': str(device),
        'model_type': model_config['model_type'],
        'model_parameters': sum(p.numel() for p in model.parameters()),
        'training_time': training_time,
        'final_train_loss': train_losses[-1],
        'final_val_accuracy': val_accuracies[-1],
        'best_val_accuracy': max(val_accuracies),
        'train_losses': train_losses,
        'val_accuracies': val_accuracies,
        'memory_info': memory_info,
        'data_config': data_config,
        'model_config': model_config
    }

# Example configuration
model_config = {
    'model_type': 'resnet',  # or 'efficientnet'
    'pretrained': True,
    'epochs': 20,
    'learning_rate': 0.001,
    'weight_decay': 1e-4
}

data_config = {
    'train_samples': 5000,
    'val_samples': 1000,
    'image_size': 224,
    'n_channels': 3,
    'n_classes': 10,
    'batch_size': 32
}

# Run training
# result = lambda_computer_vision_training(model_config, data_config)
# print(f"CV training completed! Best accuracy: {result['best_val_accuracy']:.2f}%")
# print(f"Training time: {result['training_time']:.2f} seconds")
# print(f"Model parameters: {result['model_parameters']:,}")
# print(f"GPU memory used: {result['memory_info'].get('max_allocated_mb', 0):.1f} MB")

print("Computer vision training function defined. Uncomment the lines above to run on Lambda Cloud.")

## Multi-GPU Training on Lambda Cloud

In [None]:
def setup_multi_gpu_training():
    """
    Guide for multi-GPU training on Lambda Cloud.
    """
    
    multi_gpu_guide = """
Multi-GPU Training on Lambda Cloud:

**Available Multi-GPU Instances:**
- 2x A100 (40GB): ~$2.20/hour
- 4x A100 (40GB): ~$4.40/hour
- 8x A100 (40GB): ~$8.80/hour
- 2x A100 (80GB): ~$2.80/hour
- 4x A100 (80GB): ~$5.60/hour
- 8x A100 (80GB): ~$11.20/hour
- 8x H100: ~$20.00/hour (when available)

**Setup Requirements:**
1. Launch multi-GPU instance via Lambda Cloud console
2. Install additional packages for distributed training:
   pip install accelerate deepspeed
3. Configure Clustrix for multi-GPU environment
4. Use appropriate parallelization strategy

**Parallelization Strategies:**
- Data Parallel (DP): Simple, works for most models
- Distributed Data Parallel (DDP): Better performance, recommended
- Model Parallel: For very large models that don't fit on single GPU
- Pipeline Parallel: For extremely large models
- DeepSpeed ZeRO: For memory-efficient training of large models
"""
    
    multi_gpu_code = '''
@cluster(cores=16, memory="128GB", time="04:00:00")
def lambda_multi_gpu_training(model_config, training_config):
    """Multi-GPU training example using PyTorch DDP."""
    import torch
    import torch.nn as nn
    import torch.multiprocessing as mp
    from torch.nn.parallel import DistributedDataParallel as DDP
    from torch.distributed import init_process_group, destroy_process_group
    import os
    
    def setup_ddp(rank, world_size):
        """Setup distributed data parallel."""
        os.environ['MASTER_ADDR'] = 'localhost'
        os.environ['MASTER_PORT'] = '12355'
        init_process_group(backend="nccl", rank=rank, world_size=world_size)
        torch.cuda.set_device(rank)
    
    def cleanup_ddp():
        """Clean up distributed training."""
        destroy_process_group()
    
    def train_on_gpu(rank, world_size, model_config, training_config):
        """Training function for each GPU."""
        setup_ddp(rank, world_size)
        
        # Create model and move to GPU
        model = create_model(model_config).to(rank)
        model = DDP(model, device_ids=[rank])
        
        # Create data loader with DistributedSampler
        train_loader = create_distributed_dataloader(training_config, rank, world_size)
        
        # Training loop
        optimizer = torch.optim.AdamW(model.parameters(), lr=training_config['lr'])
        
        for epoch in range(training_config['epochs']):
            train_loader.sampler.set_epoch(epoch)  # Important for proper shuffling
            
            for batch_idx, (data, target) in enumerate(train_loader):
                data, target = data.to(rank), target.to(rank)
                
                optimizer.zero_grad()
                output = model(data)
                loss = nn.CrossEntropyLoss()(output, target)
                loss.backward()
                optimizer.step()
                
                if rank == 0 and batch_idx % 100 == 0:
                    print(f'Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}')
        
        cleanup_ddp()
    
    # Launch multi-GPU training
    world_size = torch.cuda.device_count()
    print(f"Starting multi-GPU training on {world_size} GPUs")
    
    mp.spawn(
        train_on_gpu,
        args=(world_size, model_config, training_config),
        nprocs=world_size,
        join=True
    )
    
    return {"training_completed": True, "gpus_used": world_size}
'''
    
    accelerate_config = '''
# Alternative: Using HuggingFace Accelerate for easier multi-GPU setup

@cluster(cores=16, memory="128GB", time="04:00:00")
def lambda_accelerate_training(model_config, training_config):
    """Multi-GPU training using HuggingFace Accelerate."""
    from accelerate import Accelerator
    import torch
    import torch.nn as nn
    
    # Initialize accelerator
    accelerator = Accelerator()
    device = accelerator.device
    
    # Create model and optimizer
    model = create_model(model_config)
    optimizer = torch.optim.AdamW(model.parameters(), lr=training_config['lr'])
    train_loader = create_dataloader(training_config)
    
    # Prepare for distributed training
    model, optimizer, train_loader = accelerator.prepare(
        model, optimizer, train_loader
    )
    
    # Training loop
    model.train()
    for epoch in range(training_config['epochs']):
        for batch_idx, (data, target) in enumerate(train_loader):
            with accelerator.accumulate(model):
                output = model(data)
                loss = nn.CrossEntropyLoss()(output, target)
                
                accelerator.backward(loss)
                optimizer.step()
                optimizer.zero_grad()
            
            if accelerator.is_main_process and batch_idx % 100 == 0:
                print(f'Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}')
    
    return {
        "training_completed": True,
        "num_processes": accelerator.num_processes,
        "device": str(device)
    }
'''
    
    print("Multi-GPU Training Guide:")
    print("========================")
    print(multi_gpu_guide)
    print("\nPyTorch DDP Example:")
    print(multi_gpu_code)
    print("\nHuggingFace Accelerate Example:")
    print(accelerate_config)
    
    return {
        'guide': multi_gpu_guide,
        'ddp_code': multi_gpu_code,
        'accelerate_code': accelerate_config
    }

multi_gpu_info = setup_multi_gpu_training()
print("\nMulti-GPU training guide created. Lambda Cloud excels at multi-GPU workloads!")

## Cost Optimization Strategies

In [None]:
def lambda_cloud_cost_optimization():
    """
    Cost optimization strategies for Lambda Cloud + Clustrix.
    """
    
    cost_monitoring_code = '''
@cluster(cores=2, memory="4GB")
def monitor_lambda_costs():
    """Monitor Lambda Cloud usage and costs."""
    import requests
    import time
    import subprocess
    
    # GPU utilization monitoring
    def get_gpu_utilization():
        try:
            result = subprocess.run(['nvidia-smi', '--query-gpu=utilization.gpu,memory.used,memory.total', 
                                   '--format=csv,noheader,nounits'], 
                                  capture_output=True, text=True)
            if result.returncode == 0:
                lines = result.stdout.strip().split('\\n')
                gpu_stats = []
                for i, line in enumerate(lines):
                    parts = line.split(', ')
                    gpu_stats.append({
                        'gpu_id': i,
                        'utilization_percent': int(parts[0]),
                        'memory_used_mb': int(parts[1]),
                        'memory_total_mb': int(parts[2]),
                        'memory_utilization_percent': round(int(parts[1]) / int(parts[2]) * 100, 1)
                    })
                return gpu_stats
        except Exception as e:
            return {"error": str(e)}
        return []
    
    # Cost estimation
    def estimate_costs(instance_type, hours_used):
        # Lambda Cloud pricing (approximate)
        pricing = {
            'rtx6000ada': 0.75,
            'a10': 0.60,
            'a100_40gb': 1.10,
            'a100_80gb': 1.40,
            'h100': 2.50,
            '2xa100_40gb': 2.20,
            '4xa100_40gb': 4.40,
            '8xa100_40gb': 8.80
        }
        
        hourly_rate = pricing.get(instance_type, 1.0)
        return {
            'instance_type': instance_type,
            'hourly_rate': hourly_rate,
            'hours_used': hours_used,
            'estimated_cost': hourly_rate * hours_used
        }
    
    # Monitoring report
    gpu_stats = get_gpu_utilization()
    current_time = time.time()
    
    return {
        'timestamp': current_time,
        'gpu_utilization': gpu_stats,
        'monitoring_tips': [
            'Keep GPU utilization > 80% for cost efficiency',
            'Monitor memory usage to choose right instance type',
            'Use mixed precision to reduce memory requirements',
            'Terminate instances immediately after training'
        ]
    }

# Cost tracking decorator
def track_costs(instance_type='a100_40gb'):
    """Decorator to track costs of Clustrix jobs."""
    def decorator(func):
        def wrapper(*args, **kwargs):
            start_time = time.time()
            try:
                result = func(*args, **kwargs)
                success = True
            except Exception as e:
                result = {"error": str(e)}
                success = False
            
            end_time = time.time()
            duration_hours = (end_time - start_time) / 3600
            
            cost_info = estimate_costs(instance_type, duration_hours)
            
            return {
                'result': result,
                'success': success,
                'duration_seconds': end_time - start_time,
                'cost_info': cost_info
            }
        return wrapper
    return decorator

# Example usage:
# @track_costs('a100_40gb')
# @cluster(cores=8, memory="32GB")
# def my_expensive_training():
#     # Your training code here
#     pass
'''
    
    print("Cost Monitoring Code:")
    print(cost_monitoring_code)
    
    return {
        'monitoring_code': cost_monitoring_code
    }

cost_info = lambda_cloud_cost_optimization()
print("\nCost optimization guide created. Lambda Cloud offers excellent GPU price/performance!")

### Lambda Cloud Cost Optimization

#### 💰 Instance Selection
- **RTX 6000 Ada**: Best value for most ML workloads (~$0.75/hour)
- **A10**: Good balance of performance and cost (~$0.60/hour)
- **A100 40GB**: For large models requiring more VRAM (~$1.10/hour)
- **A100 80GB**: Only when 40GB is insufficient (~$1.40/hour)
- **H100**: Premium option for cutting-edge research (~$2.50/hour)

#### ⏰ Usage Patterns
- Use "persistent" instances for ongoing development
- Terminate instances immediately after training completion
- Schedule training jobs during off-peak hours if possible
- Use local development for debugging, GPU for final training

#### 🔧 Optimization Techniques
- Mixed precision training (fp16) to reduce memory usage
- Gradient accumulation for effective larger batch sizes
- Model checkpointing to resume interrupted training
- Efficient data loading with multiple workers
- Early stopping to avoid overtraining

#### 📊 Monitoring and Management
- Monitor GPU utilization with nvidia-smi
- Track training progress with logging
- Set training time limits to prevent runaway costs
- Use Clustrix timeouts as safety nets
- Regular cost reviews and budget alerts

#### 🚀 Clustrix-Specific Optimizations
- Use Clustrix auto-cleanup features
- Implement job queuing for multiple experiments
- Leverage Clustrix's timeout mechanisms
- Use remote environment caching

## Best Practices and Troubleshooting

In [None]:
def lambda_cloud_best_practices():
    """
    Best practices for Lambda Cloud + Clustrix integration.
    """
    
    monitoring_script = '''
#!/bin/bash
# Lambda Cloud monitoring script
# Save as monitor_training.sh and run with: bash monitor_training.sh

echo "Lambda Cloud Training Monitor"
echo "============================"
echo "Start time: $(date)"
echo ""

# System information
echo "System Information:"
echo "------------------"
nvidia-smi --query-gpu=gpu_name,memory.total,power.draw --format=csv
echo ""

# Monitor GPU usage every 30 seconds
while true; do
    echo "GPU Status at $(date):"
    nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu --format=csv,noheader
    echo ""
    
    # Check if training process is still running
    if ! pgrep -f python > /dev/null; then
        echo "No Python processes found. Training may have completed."
        break
    fi
    
    sleep 30
done

echo "Monitoring completed at $(date)"
'''
    
    print("Monitoring Script:")
    print("==================")
    print(monitoring_script)
    
    return {
        'monitoring_script': monitoring_script
    }

practices_info = lambda_cloud_best_practices()
print("\nComprehensive best practices guide created for Lambda Cloud.")

### Lambda Cloud + Clustrix Best Practices

#### 🚀 Performance Optimization
- Always use mixed precision (fp16) when possible
- Optimize data loading with multiple workers and pin_memory
- Use appropriate batch sizes to maximize GPU utilization
- Enable tensor cores for compatible operations
- Pre-allocate GPU memory to avoid fragmentation

#### 💾 Data Management
- Store datasets on fast NVMe storage when available
- Use data streaming for very large datasets
- Implement efficient data preprocessing pipelines
- Cache frequently used data in memory
- Use appropriate data formats (e.g., HDF5, Parquet)

#### 🔧 Environment Setup
- Use conda environments for reproducible setups
- Pin package versions in requirements.txt
- Install packages from conda-forge when possible
- Use uv package manager for faster installs
- Set up proper CUDA environment variables

#### 🛠️ Development Workflow
- Develop and debug locally, train on Lambda Cloud
- Use small datasets for initial testing
- Implement proper logging and monitoring
- Save model checkpoints regularly
- Use version control for experiment tracking

#### 🔒 Security
- Use SSH keys instead of passwords
- Keep SSH keys secure and rotate regularly
- Don't store credentials in code or notebooks
- Use environment variables for configuration
- Monitor instance access logs

### Common Issues and Solutions

#### ❌ CUDA out of memory errors
✅ **Solutions:**
- Reduce batch size
- Enable gradient checkpointing
- Use mixed precision training
- Clear GPU cache with torch.cuda.empty_cache()
- Consider model parallelism for large models

#### ❌ Slow data loading
✅ **Solutions:**
- Increase num_workers in DataLoader
- Enable pin_memory for GPU transfers
- Use faster storage (NVMe over network storage)
- Implement data prefetching
- Optimize data preprocessing

#### ❌ SSH connection timeouts
✅ **Solutions:**
- Configure SSH keep-alive settings
- Use screen or tmux for long-running jobs
- Implement proper error handling in Clustrix
- Set appropriate timeout values
- Monitor network connectivity

#### ❌ Low GPU utilization
✅ **Solutions:**
- Increase batch size if memory allows
- Optimize data loading pipeline
- Use asynchronous data transfers
- Profile code to identify bottlenecks
- Consider multi-GPU training

#### ❌ Package installation failures
✅ **Solutions:**
- Use conda for system-level packages
- Check CUDA compatibility versions
- Clear pip cache if needed
- Use --no-cache-dir flag for pip
- Install packages in correct order

## Instance Management and Cleanup

### Lambda Cloud Instance Management

#### 🔍 Check Running Instances

**Via CLI:**
```bash
lambda-cloud instance list
```

**Via Web Console:**
Visit: https://cloud.lambdalabs.com/instances

#### ⏹️ Terminate Instances

**Terminate specific instance:**
```bash
lambda-cloud instance terminate <INSTANCE_ID>
```

**Terminate all instances (DANGEROUS!):**
```bash
lambda-cloud instance list --format=csv | grep -v "instance_id" | cut -d',' -f1 | xargs -I {} lambda-cloud instance terminate {}
```

#### 💾 Save Work Before Termination

**Save models to persistent storage:**
```bash
rsync -avz ubuntu@<INSTANCE_IP>:/path/to/models/ ./local_models/
```

**Save logs and results:**
```bash
scp -r ubuntu@<INSTANCE_IP>:/tmp/clustrix/ ./results/
```

#### 📊 Cost Monitoring

**Check current usage:**
```bash
lambda-cloud instance list --format=table
```

**Estimate costs:**
```bash
lambda-cloud instance list --format=csv | awk -F',' 'NR>1 {print $2, $3}' | while read type status; do
    if [ "$status" = "active" ]; then
        echo "Active instance: $type"
    fi
done
```

### Automated Cleanup Script

Save this as `lambda_cleanup.sh` for automated instance management:

```bash
#!/bin/bash
# Automated cleanup script for Lambda Cloud
# Save as lambda_cleanup.sh

set -e

echo "Lambda Cloud Automated Cleanup"
echo "=============================="

# Check if lambda-cloud CLI is installed
if ! command -v lambda-cloud &> /dev/null; then
    echo "Error: lambda-cloud CLI not found. Please install it first."
    exit 1
fi

# List current instances
echo "Current instances:"
lambda-cloud instance list
echo ""

# Ask for confirmation
read -p "Do you want to terminate ALL instances? (y/N): " -n 1 -r
echo ""
if [[ ! $REPLY =~ ^[Yy]$ ]]; then
    echo "Cleanup cancelled."
    exit 0
fi

# Get instance IDs
INSTANCE_IDS=$(lambda-cloud instance list --format=csv | grep -v "instance_id" | cut -d',' -f1)

if [ -z "$INSTANCE_IDS" ]; then
    echo "No instances to terminate."
    exit 0
fi

# Terminate instances
echo "Terminating instances..."
for instance_id in $INSTANCE_IDS; do
    echo "Terminating instance: $instance_id"
    lambda-cloud instance terminate $instance_id
done

echo "All instances terminated."
echo "Please verify termination in the web console: https://cloud.lambdalabs.com/instances"
```

### Clustrix Integration Manager

In [None]:
# Integrate cleanup with Clustrix workflows

from clustrix import configure
import subprocess
import time

class LambdaCloudManager:
    """Manager for Lambda Cloud instances with Clustrix integration."""
    
    def __init__(self):
        self.active_instances = []
    
    def launch_instance_for_clustrix(self, instance_type, ssh_key_name):
        """Launch instance and configure Clustrix."""
        # Launch instance
        result = subprocess.run([
            'lambda-cloud', 'instance', 'launch',
            '--instance-type', instance_type,
            '--ssh-key-name', ssh_key_name
        ], capture_output=True, text=True)
        
        if result.returncode != 0:
            raise Exception(f"Failed to launch instance: {result.stderr}")
        
        # Parse instance ID and IP (simplified)
        instance_id = "extracted_from_output"  # Parse from result.stdout
        instance_ip = "extracted_from_output"   # Parse from result.stdout
        
        # Wait for instance to be ready
        time.sleep(60)  # Wait for startup
        
        # Configure Clustrix
        configure(
            cluster_type="ssh",
            cluster_host=instance_ip,
            username="ubuntu",
            key_file="~/.ssh/id_rsa",
            remote_work_dir="/tmp/clustrix",
            package_manager="auto",
            cleanup_on_success=True
        )
        
        self.active_instances.append({
            'id': instance_id,
            'ip': instance_ip,
            'type': instance_type,
            'launch_time': time.time()
        })
        
        return instance_id, instance_ip
    
    def cleanup_all_instances(self):
        """Clean up all managed instances."""
        for instance in self.active_instances:
            try:
                subprocess.run([
                    'lambda-cloud', 'instance', 'terminate', instance['id']
                ], check=True)
                print(f"Terminated instance {instance['id']}")
            except subprocess.CalledProcessError as e:
                print(f"Failed to terminate {instance['id']}: {e}")
        
        self.active_instances.clear()
    
    def __del__(self):
        """Ensure cleanup on object destruction."""
        if self.active_instances:
            print("Warning: Active instances detected. Cleaning up...")
            self.cleanup_all_instances()

# Usage example:
# manager = LambdaCloudManager()
# try:
#     instance_id, ip = manager.launch_instance_for_clustrix('a100', 'my-ssh-key')
#     # Run your Clustrix computations
#     result = my_clustrix_function()
# finally:
#     manager.cleanup_all_instances()

## Summary

This tutorial covered:

1. **Setup**: Lambda Cloud account creation and instance management
2. **GPU Computing**: High-performance GPU instances for ML workloads
3. **Deep Learning**: PyTorch training with GPU acceleration
4. **Transformer Models**: Fine-tuning with HuggingFace Transformers
5. **Computer Vision**: CNN training with data augmentation
6. **Multi-GPU Training**: Distributed training across multiple GPUs
7. **Cost Optimization**: Strategies to minimize GPU computing costs
8. **Best Practices**: Performance optimization and troubleshooting
9. **Instance Management**: Automated cleanup and monitoring

### Key Advantages of Lambda Cloud + Clustrix

- **GPU Focus**: Specialized in high-performance GPU computing
- **Cost Effective**: Competitive pricing for GPU instances
- **Simple Management**: Easy instance launching and termination
- **High Performance**: Latest NVIDIA GPUs (A100, H100, RTX)
- **Fast Networking**: InfiniBand for multi-GPU communication
- **ML Optimized**: Pre-configured environments for machine learning
- **Flexible Scaling**: From single GPU to large multi-GPU clusters

### Lambda Cloud Pricing Advantages

- **RTX 6000 Ada**: Excellent price/performance for most ML workloads
- **A100 40GB/80GB**: Industry-standard for large-scale training
- **H100**: Cutting-edge performance for the most demanding workloads
- **Multi-GPU**: Cost-effective scaling for distributed training
- **No Hidden Fees**: Simple per-hour pricing

### Next Steps

1. Create your Lambda Cloud account and add SSH keys
2. Start with a single GPU instance for testing
3. Configure Clustrix for your Lambda Cloud instance
4. Run the provided examples to verify setup
5. Scale to multi-GPU instances for larger workloads
6. Implement cost monitoring and automated cleanup

### Use Cases

- **Deep Learning Research**: Train large neural networks efficiently
- **Computer Vision**: Process large image datasets with CNNs
- **NLP**: Fine-tune transformer models on custom datasets
- **Scientific Computing**: GPU-accelerated simulations and modeling
- **Prototyping**: Rapid experimentation with different architectures
- **Production Training**: Scale up successful experiments

### Resources

- [Lambda Cloud Console](https://cloud.lambdalabs.com/)
- [Lambda Cloud Documentation](https://lambdalabs.com/service/gpu-cloud/documentation)
- [Lambda Cloud CLI](https://github.com/LambdaLabsML/lambda-cloud-cli)
- [PyTorch Documentation](https://pytorch.org/docs/)
- [HuggingFace Transformers](https://huggingface.co/transformers/)
- [Clustrix Documentation](https://clustrix.readthedocs.io/)

**Remember**: Lambda Cloud excels at GPU computing! Always terminate instances when not in use to control costs, and leverage Clustrix's distributed computing capabilities to scale your ML workloads efficiently.