# Modal: GPU-Accelerated AI Model Training

This course covers how to use Modal, a cloud platform for running machine learning workloads, with a focus on GPU-accelerated model training.

## What is Modal?

Modal is a cloud platform designed for running machine learning and data processing workloads. It allows you to:

- Run Python functions in the cloud with zero infrastructure management
- Access GPUs on-demand for deep learning
- Scale your applications automatically
- Deploy endpoints as API services
- Run scheduled jobs and batch processing

## Why Use Modal for AI Training?

- **Access to GPU Hardware**: Use powerful GPUs without purchasing expensive hardware
- **Pay-per-use Pricing**: Only pay for the compute you actually use
- **Zero Infrastructure Management**: No need to configure servers or manage containers
- **Easy Scaling**: Train on multiple GPUs with minimal code changes
- **Simplified Deployment**: Easily serve trained models as API endpoints

## Course Outline

1. Setting Up Modal
2. Understanding Modal Concepts
3. Running Simple Functions on Modal
4. Adding GPU Acceleration
5. Training a Deep Learning Model
6. Distributed Training
7. Deploying Trained Models
8. Best Practices and Optimization

## Prerequisites

- Python programming knowledge
- Basic understanding of machine learning concepts
- A Modal account (we'll cover how to set one up)

## Installation

In [None]:
# Install Modal and other required packages
!pip install modal torch torchvision transformers datasets matplotlib pandas numpy tqdm

## 1. Setting Up Modal

To use Modal, you need to:
1. Create a Modal account at https://modal.com/
2. Install the Modal CLI and Python client
3. Set up your authentication token

Let's go through these steps:

In [2]:
# First, install the Modal client if you haven't already
!pip install -q modal

# Import Modal
import modal

# Set up authentication (you need to run this once)
# This will open a browser window to authenticate
# !modal token new

# Verify that you're authenticated
try:
    modal.Image.debian_slim().run(["echo", "Hello from Modal!"])
    print("Authentication successful! Your Modal setup is working correctly.")
except Exception as e:
    print(f"Authentication error: {e}")
    print("Please run '!modal token new' to set up your authentication token.")

Authentication error: 'Image' object has no attribute 'run'
Please run '!modal token new' to set up your authentication token.




## 2. Understanding Modal Concepts

Before we dive into GPU-accelerated training, let's understand the key concepts in Modal:

- **Functions**: Python functions that run in the cloud
- **Images**: Docker containers with required dependencies
- **Volumes**: Persistent storage for your functions
- **Apps**: Groups of functions that can be deployed together
- **Secrets**: Secure way to store API keys and other credentials

Let's create a simple function to demonstrate how Modal works:

In [4]:
# Define a basic Modal function

import modal

app = modal.App("basic-demo")

@app.function()
def hello_world(name):
    return f"Hello, {name} from Modal!"

# Run the function if this file is executed
if __name__ == "__main__":
    with app.run():
        result = hello_world.remote("Modal User")
        print(result)

Hello, Modal User from Modal!


## 3. Creating a Custom Environment

For machine learning, we need to create a custom environment with the required dependencies:

In [5]:
from modal import App, Image

# Create a custom image with ML dependencies
ml_image = Image.debian_slim().pip_install(
    "torch", 
    "torchvision", 
    "transformers",
    "scikit-learn",
    "pandas",
    "matplotlib",
    "numpy"
)

app = App("ml-environment", image=ml_image)

@app.function()
def check_versions():
    import torch
    import transformers
    import sklearn
    
    result = {
        "torch": torch.__version__,
        "cuda_available": torch.cuda.is_available(),
        "transformers": transformers.__version__,
        "sklearn": sklearn.__version__
    }
    
    return result

if __name__ == "__main__":
    with app.run():
        versions = check_versions.remote()
        print("Installed packages:")
        for package, version in versions.items():
            print(f"- {package}: {version}")

Installed packages:
- torch: 2.7.1+cu126
- cuda_available: False
- transformers: 4.53.1
- sklearn: 1.7.0


## 4. Adding GPU Acceleration

One of the most powerful features of Modal is the ability to easily access GPU hardware. Let's see how to configure a function to use a GPU:

In [6]:
from modal import App, Image, gpu

# Create an image with PyTorch and CUDA support
gpu_image = Image.debian_slim().pip_install(
    "torch", 
    "torchvision"
)

app = App("gpu-demo", image=gpu_image)

# Specify the GPU type using the gpu parameter
# Available options: T4, A10G, A100, H100
@app.function(gpu=gpu.T4())
def check_gpu():
    import torch
    
    print("CUDA available:", torch.cuda.is_available())
    
    if torch.cuda.is_available():
        print("CUDA Device:", torch.cuda.get_device_name(0))
        
        # Run a simple test on GPU
        x = torch.randn(1000, 1000).cuda()
        y = torch.randn(1000, 1000).cuda()
        
        # Measure time for matrix multiplication on GPU
        import time
        start_time = time.time()
        z = torch.matmul(x, y)
        torch.cuda.synchronize()
        gpu_time = time.time() - start_time
        
        # For comparison, do the same on CPU
        x_cpu = x.cpu()
        y_cpu = y.cpu()
        start_time = time.time()
        z_cpu = torch.matmul(x_cpu, y_cpu)
        cpu_time = time.time() - start_time
        
        return {
            "gpu_device": torch.cuda.get_device_name(0),
            "gpu_time": gpu_time,
            "cpu_time": cpu_time,
            "speedup_factor": cpu_time / gpu_time
        }
    
    return {"error": "CUDA not available"}

if __name__ == "__main__":
    with app.run():
        result = check_gpu.remote()
        print("\nTest results:")
        for key, value in result.items():
            if isinstance(value, float):
                print(f"- {key}: {value:.6f}")
            else:
                print(f"- {key}: {value}")

C:\Users\LENOVO\AppData\Local\Temp\ipykernel_17056\417075541.py:13: DeprecationError: 2025-02-07: `gpu=T4(...)` is deprecated. Use `gpu="T4"` instead.
  @app.function(gpu=gpu.T4())



Test results:
- gpu_device: Tesla T4
- gpu_time: 0.049875
- cpu_time: 0.018355
- speedup_factor: 0.368020


## 5. Training a Deep Learning Model

Now, let's train a simple deep learning model using a GPU on Modal. We'll train a convolutional neural network (CNN) on the MNIST dataset:

In [None]:
from modal import App, Image, gpu, mount
import os

# Create an image with deep learning dependencies
dl_image = Image.debian_slim().pip_install(
    "torch", 
    "torchvision", 
    "tqdm",
    "matplotlib"
)

app = App("mnist-training", image=dl_image)

# Mount the current directory to save the trained model
LOCAL_DIR = os.getcwd()

@app.function(
    gpu=gpu.T4()
)
# @mount.from_local_dir(LOCAL_DIR, remote_path="/root/outputs")
def train_mnist_model(epochs=5, batch_size=64):
    import torch
    import torch.nn as nn
    import torch.optim as optim
    from torch.utils.data import DataLoader
    from torchvision import datasets, transforms
    from tqdm import tqdm
    import matplotlib.pyplot as plt
    import os
    
    # Set device
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"Using device: {device}")
    
    # Define a simple CNN model
    class SimpleCNN(nn.Module):
        def __init__(self):
            super(SimpleCNN, self).__init__()
            self.conv1 = nn.Conv2d(1, 32, kernel_size=3, stride=1, padding=1)
            self.relu1 = nn.ReLU()
            self.pool1 = nn.MaxPool2d(kernel_size=2)
            self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1)
            self.relu2 = nn.ReLU()
            self.pool2 = nn.MaxPool2d(kernel_size=2)
            self.fc1 = nn.Linear(64 * 7 * 7, 128)
            self.relu3 = nn.ReLU()
            self.fc2 = nn.Linear(128, 10)
            
        def forward(self, x):
            x = self.pool1(self.relu1(self.conv1(x)))
            x = self.pool2(self.relu2(self.conv2(x)))
            x = x.view(-1, 64 * 7 * 7)
            x = self.relu3(self.fc1(x))
            x = self.fc2(x)
            return x
    
    # Load MNIST dataset
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
    ])
    
    train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
    test_dataset = datasets.MNIST(root='./data', train=False, download=True, transform=transform)
    
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
    
    # Initialize the model
    model = SimpleCNN().to(device)
    
    # Define loss function and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    
    # Training loop
    train_losses = []
    test_accuracies = []
    
    for epoch in range(epochs):
        model.train()
        train_loss = 0
        
        # Use tqdm for progress bar
        with tqdm(train_loader, unit="batch") as tepoch:
            for data, target in tepoch:
                tepoch.set_description(f"Epoch {epoch+1}/{epochs}")
                
                data, target = data.to(device), target.to(device)
                
                optimizer.zero_grad()
                output = model(data)
                loss = criterion(output, target)
                loss.backward()
                optimizer.step()
                
                train_loss += loss.item()
                tepoch.set_postfix(loss=loss.item())
        
        train_loss /= len(train_loader)
        train_losses.append(train_loss)
        
        # Evaluate on test set
        model.eval()
        correct = 0
        total = 0
        
        with torch.no_grad():
            for data, target in test_loader:
                data, target = data.to(device), target.to(device)
                output = model(data)
                _, predicted = torch.max(output.data, 1)
                total += target.size(0)
                correct += (predicted == target).sum().item()
        
        accuracy = 100 * correct / total
        test_accuracies.append(accuracy)
        
        print(f"Epoch {epoch+1}/{epochs}, Train Loss: {train_loss:.4f}, Test Accuracy: {accuracy:.2f}%")
    
    # Save the model
    output_dir = "/root/outputs"
    os.makedirs(output_dir, exist_ok=True)
    model_path = os.path.join(output_dir, "mnist_model.pth")
    torch.save(model.state_dict(), model_path)
    print(f"Model saved to {model_path}")
    
    # Plot training progress
    plt.figure(figsize=(12, 4))
    plt.subplot(1, 2, 1)
    plt.plot(train_losses)
    plt.title("Training Loss")
    plt.xlabel("Epoch")
    plt.ylabel("Loss")
    
    plt.subplot(1, 2, 2)
    plt.plot(test_accuracies)
    plt.title("Test Accuracy")
    plt.xlabel("Epoch")
    plt.ylabel("Accuracy (%)")
    
    plt.tight_layout()
    plt.savefig(os.path.join(output_dir, "training_progress.png"))
    
    return {
        "final_train_loss": train_losses[-1],
        "final_test_accuracy": test_accuracies[-1],
        "model_path": model_path
    }

if __name__ == "__main__":
    with app.run():
        result = train_mnist_model.remote(epochs=5, batch_size=128)
        print("\nTraining complete!")
        print(f"Final training loss: {result['final_train_loss']:.4f}")
        print(f"Final test accuracy: {result['final_test_accuracy']:.2f}%")
        print(f"Model saved to: {result['model_path']}")

## 6. Fine-tuning a Pretrained Model

For more advanced applications, let's fine-tune a pre-trained model on a custom dataset. We'll use a pre-trained vision transformer (ViT) model and fine-tune it on the CIFAR-10 dataset:

In [None]:
from modal import App, Image, gpu, mount
import os

# Create an image with huggingface transformers and related dependencies
finetune_image = Image.debian_slim().pip_install(
    "torch", 
    "torchvision", 
    "transformers",
    "datasets",
    "accelerate",
    "evaluate",
    "matplotlib"
)

app = App("vit-finetune", image=finetune_image)

# Mount the current directory to save the fine-tuned model
LOCAL_DIR = os.path.dirname(os.path.abspath(__file__))

@app.function(
    gpu=gpu.T4(),
    timeout=3600,  # Allow up to 1 hour for fine-tuning
    mounts=[mount.Mount.from_local_dir(LOCAL_DIR, remote_path="/root/outputs")]
)
def finetune_vit(batch_size=32, num_epochs=3):
    import torch
    import torchvision
    import torchvision.transforms as transforms
    from torch.utils.data import DataLoader
    from transformers import ViTForImageClassification, ViTImageProcessor
    from transformers import TrainingArguments, Trainer
    from datasets import load_dataset, load_metric
    import numpy as np
    import evaluate
    import matplotlib.pyplot as plt
    import os
    
    print("Setting up fine-tuning process...")
    
    # Set device
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"Using device: {device}")
    
    # Load CIFAR-10 dataset using Hugging Face datasets
    cifar10 = load_dataset("cifar10")
    
    # Get the labels
    labels = cifar10["train"].features["label"].names
    label2id = {label: i for i, label in enumerate(labels)}
    id2label = {i: label for i, label in enumerate(labels)}
    
    # Load pre-trained ViT model and processor
    model_name = "google/vit-base-patch16-224"
    processor = ViTImageProcessor.from_pretrained(model_name)
    model = ViTForImageClassification.from_pretrained(
        model_name,
        num_labels=10,
        label2id=label2id,
        id2label=id2label
    )
    
    # Define image transformations
    def transform_images(examples):
        images = [img.convert("RGB") for img in examples["img"]]
        processed_images = processor(images, return_tensors="pt")
        examples["pixel_values"] = processed_images["pixel_values"]
        return examples
    
    # Apply transformations to the dataset
    transformed_cifar10 = cifar10.with_transform(transform_images)
    
    # Define evaluation metrics
    accuracy_metric = evaluate.load("accuracy")
    
    def compute_metrics(eval_pred):
        predictions, labels = eval_pred
        predictions = np.argmax(predictions, axis=1)
        return accuracy_metric.compute(predictions=predictions, references=labels)
    
    # Define training arguments
    output_dir = "/root/outputs/vit-cifar10"
    os.makedirs(output_dir, exist_ok=True)
    
    training_args = TrainingArguments(
        output_dir=output_dir,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        evaluation_strategy="epoch",
        num_train_epochs=num_epochs,
        fp16=True,  # Use mixed precision training
        save_strategy="epoch",
        learning_rate=5e-5,
        remove_unused_columns=False,
        push_to_hub=False,
        report_to="none",
        load_best_model_at_end=True,
    )
    
    # Define Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=transformed_cifar10["train"],
        eval_dataset=transformed_cifar10["test"],
        compute_metrics=compute_metrics,
    )
    
    # Fine-tune the model
    print("Starting fine-tuning...")
    trainer.train()
    
    # Evaluate the model
    eval_results = trainer.evaluate()
    print(f"Evaluation results: {eval_results}")
    
    # Save the fine-tuned model
    model_save_path = os.path.join(output_dir, "final_model")
    trainer.save_model(model_save_path)
    processor.save_pretrained(model_save_path)
    
    # Test the model on a few examples and create visualizations
    def predict_and_visualize(dataset, num_samples=5):
        samples = dataset.shuffle(seed=42).select(range(num_samples))
        images = [sample["img"] for sample in samples]
        labels = [sample["label"] for sample in samples]
        
        # Process images and run prediction
        inputs = processor(images, return_tensors="pt").to(device)
        with torch.no_grad():
            outputs = model(**inputs)
        
        preds = outputs.logits.argmax(dim=-1).cpu().numpy()
        
        # Plot results
        fig, axes = plt.subplots(1, num_samples, figsize=(15, 3))
        for i, (image, pred, label) in enumerate(zip(images, preds, labels)):
            axes[i].imshow(image)
            axes[i].set_title(f"Pred: {id2label[pred]}\nTrue: {id2label[label]}")
            axes[i].axis("off")
            
        plt.tight_layout()
        plt.savefig(os.path.join(output_dir, "predictions.png"))
    
    print("Creating visualizations...")
    predict_and_visualize(transformed_cifar10["test"])
    
    return {
        "accuracy": eval_results["eval_accuracy"],
        "model_path": model_save_path,
        "visualizations_path": os.path.join(output_dir, "predictions.png")
    }

if __name__ == "__main__":
    with app.run():
        print("Starting fine-tuning process...")
        result = finetune_vit.remote(batch_size=16, num_epochs=3)
        print("\nFine-tuning complete!")
        print(f"Final accuracy: {result['accuracy']:.4f}")
        print(f"Model saved to: {result['model_path']}")
        print(f"Visualizations saved to: {result['visualizations_path']}")

## 7. Distributed Training with Modal

Modal makes it easy to run distributed training across multiple GPUs. Let's see how to implement distributed training using PyTorch's DistributedDataParallel (DDP):

In [None]:
from modal import App, Image, gpu, mount, Stub
import os

# Create an image with PyTorch and related dependencies
distributed_image = Image.debian_slim().pip_install(
    "torch>=1.9.0", 
    "torchvision", 
    "tqdm"
)

app = Stub("distributed-training", image=distributed_image)

# Mount the current directory to save the trained model
LOCAL_DIR = os.path.dirname(os.path.abspath(__file__))

@app.function(
    gpu=gpu.T4(),
    mounts=[mount.Mount.from_local_dir(LOCAL_DIR, remote_path="/root/outputs")]
)
def distributed_trainer(rank, world_size, batch_size=64, epochs=5):
    """
    Function to run on each worker in the distributed training.
    """
    import torch
    import torch.nn as nn
    import torch.optim as optim
    import torch.distributed as dist
    from torch.nn.parallel import DistributedDataParallel as DDP
    from torch.utils.data import DataLoader
    from torch.utils.data.distributed import DistributedSampler
    from torchvision import datasets, transforms
    from tqdm import tqdm
    import os
    
    # Initialize the distributed process group
    dist.init_process_group(
        backend="nccl",  # Use NCCL for GPU training
        init_method="env://",
        world_size=world_size,
        rank=rank
    )
    
    # Set the device for this process
    torch.cuda.set_device(0)  # Only one GPU per process in Modal
    device = torch.device("cuda")
    
    # Define a CNN model
    class CNN(nn.Module):
        def __init__(self):
            super(CNN, self).__init__()
            self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
            self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
            self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
            self.fc1 = nn.Linear(64 * 7 * 7, 128)
            self.fc2 = nn.Linear(128, 10)
            self.relu = nn.ReLU()
            
        def forward(self, x):
            x = self.pool(self.relu(self.conv1(x)))
            x = self.pool(self.relu(self.conv2(x)))
            x = x.view(-1, 64 * 7 * 7)
            x = self.relu(self.fc1(x))
            x = self.fc2(x)
            return x
    
    # Create model and move it to the GPU
    model = CNN().to(device)
    
    # Wrap the model with DDP
    model = DDP(model, device_ids=[0])
    
    # Define loss function and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    
    # Define transforms
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
    ])
    
    # Load the MNIST dataset
    train_dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)
    test_dataset = datasets.MNIST('./data', train=False, transform=transform)
    
    # Create distributed sampler
    train_sampler = DistributedSampler(
        train_dataset,
        num_replicas=world_size,
        rank=rank
    )
    
    # Create data loaders
    train_loader = DataLoader(
        dataset=train_dataset,
        batch_size=batch_size,
        sampler=train_sampler
    )
    
    test_loader = DataLoader(
        dataset=test_dataset,
        batch_size=batch_size
    )
    
    # Training loop
    for epoch in range(epochs):
        model.train()
        train_sampler.set_epoch(epoch)  # Important to ensure data shuffling
        running_loss = 0.0
        
        if rank == 0:
            pbar = tqdm(total=len(train_loader), desc=f"Epoch {epoch+1}/{epochs}")
        
        for i, (inputs, labels) in enumerate(train_loader):
            inputs, labels = inputs.to(device), labels.to(device)
            
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            
            running_loss += loss.item()
            
            if rank == 0 and i % 100 == 99:
                pbar.update(100)
                pbar.set_postfix(loss=running_loss/100)
                running_loss = 0.0
                
        if rank == 0:
            pbar.close()
            
            # Evaluate the model
            model.eval()
            correct = 0
            total = 0
            
            with torch.no_grad():
                for inputs, labels in test_loader:
                    inputs, labels = inputs.to(device), labels.to(device)
                    outputs = model(inputs)
                    _, predicted = torch.max(outputs.data, 1)
                    total += labels.size(0)
                    correct += (predicted == labels).sum().item()
                    
            accuracy = 100 * correct / total
            print(f"Epoch {epoch+1}, Accuracy: {accuracy:.2f}%")
            
    # Save the model (only on rank 0)
    if rank == 0:
        output_dir = "/root/outputs/distributed"
        os.makedirs(output_dir, exist_ok=True)
        torch.save(model.module.state_dict(), os.path.join(output_dir, "distributed_model.pth"))
        print("Model saved!")
    
    # Clean up the process group
    dist.destroy_process_group()
    
    # Return results from rank 0
    if rank == 0:
        return {"accuracy": accuracy}
    return {}

@app.function()
def run_distributed_training(num_gpus=2, batch_size=64, epochs=5):
    """
    Coordinator function to launch the distributed training.
    """
    import os
    
    # Set up environment variables for distributed training
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "29500"
    
    print(f"Starting distributed training with {num_gpus} GPUs...")
    
    # Run the trainer function on multiple GPUs
    futures = []
    for rank in range(num_gpus):
        future = distributed_trainer.remote(
            rank=rank,
            world_size=num_gpus,
            batch_size=batch_size,
            epochs=epochs
        )
        futures.append(future)
    
    # Wait for all processes to complete
    results = [future.get() for future in futures]
    
    # Return the results from rank 0
    return results[0]

if __name__ == "__main__":
    with app.run():
        result = run_distributed_training.remote(num_gpus=2, batch_size=64, epochs=3)
        print("\nDistributed training complete!")
        print(f"Final accuracy: {result.get('accuracy', 'N/A'):.2f}%")

## 8. Deploying Trained Models as Endpoints

Once you've trained your models, you can easily deploy them as API endpoints using Modal. This allows your models to be accessible via HTTP requests, which is perfect for integrating them into other applications.

In [None]:
from modal import App, Image, gpu, mount, asgi_app
import os

# Create an image for model serving
serve_image = Image.debian_slim().pip_install(
    "torch", 
    "torchvision", 
    "fastapi", 
    "pillow", 
    "python-multipart"
)

app = App("model-serving", image=serve_image)

# Mount directory to load saved models
LOCAL_DIR = os.path.dirname(os.path.abspath(__file__))

# Define the model class and loading function
@app.cls(
    gpu=gpu.T4(),
    mounts=[mount.Mount.from_local_dir(LOCAL_DIR, remote_path="/root/models")]
)
class ModelService:
    def __enter__(self):
        import torch
        import torchvision.transforms as transforms
        from PIL import Image
        
        # Model definition (same as before)
        class SimpleCNN(torch.nn.Module):
            def __init__(self):
                super(SimpleCNN, self).__init__()
                self.conv1 = torch.nn.Conv2d(1, 32, kernel_size=3, stride=1, padding=1)
                self.relu1 = torch.nn.ReLU()
                self.pool1 = torch.nn.MaxPool2d(kernel_size=2)
                self.conv2 = torch.nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1)
                self.relu2 = torch.nn.ReLU()
                self.pool2 = torch.nn.MaxPool2d(kernel_size=2)
                self.fc1 = torch.nn.Linear(64 * 7 * 7, 128)
                self.relu3 = torch.nn.ReLU()
                self.fc2 = torch.nn.Linear(128, 10)
                
            def forward(self, x):
                x = self.pool1(self.relu1(self.conv1(x)))
                x = self.pool2(self.relu2(self.conv2(x)))
                x = x.view(-1, 64 * 7 * 7)
                x = self.relu3(self.fc1(x))
                x = self.fc2(x)
                return x
        
        # Load the model
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        print(f"Using device: {self.device}")
        
        self.model = SimpleCNN().to(self.device)
        
        # Try to load from either of the potential locations
        model_paths = [
            "/root/models/mnist_model.pth",
            "/root/outputs/mnist_model.pth"
        ]
        
        model_loaded = False
        for path in model_paths:
            try:
                self.model.load_state_dict(torch.load(path, map_location=self.device))
                print(f"Model loaded from {path}")
                model_loaded = True
                break
            except Exception as e:
                print(f"Could not load from {path}: {e}")
        
        if not model_loaded:
            print("WARNING: Could not load pre-trained model. Using untrained model!")
        
        self.model.eval()
        
        # Define image transformations
        self.transform = transforms.Compose([
            transforms.Grayscale(),
            transforms.Resize((28, 28)),
            transforms.ToTensor(),
            transforms.Normalize((0.1307,), (0.3081,))
        ])
        
    def predict(self, image_bytes):
        import torch
        import io
        from PIL import Image
        
        # Load image from bytes
        image = Image.open(io.BytesIO(image_bytes))
        
        # Transform image
        image_tensor = self.transform(image).unsqueeze(0).to(self.device)
        
        # Make prediction
        with torch.no_grad():
            outputs = self.model(image_tensor)
            probs = torch.nn.functional.softmax(outputs, dim=1)[0]
            predicted_class = torch.argmax(outputs, dim=1).item()
            
        # Get the top 3 predictions
        top_probs, top_classes = torch.topk(probs, 3)
        
        results = [
            {"class": int(cls), "probability": float(prob)}
            for cls, prob in zip(top_classes.cpu().numpy(), top_probs.cpu().numpy())
        ]
        
        return {
            "predicted_class": predicted_class,
            "top_predictions": results
        }

# Create a FastAPI app for the model service
from fastapi import FastAPI, File, UploadFile
import uvicorn

web_app = FastAPI()
model_service = ModelService()

@web_app.post("/predict")
async def predict(file: UploadFile = File(...)):
    # Read image file
    image_bytes = await file.read()
    
    # Make prediction using the model service
    result = model_service.predict.remote(image_bytes)
    return result

# Mount the FastAPI app to be served by Modal
@app.asgi_app()
def fastapi_app():
    return web_app

if __name__ == "__main__":
    print("Starting model serving app...")
    print("Deploy this app with: modal deploy modal_serving.py")
    print("Or run locally with: modal serve modal_serving.py")

## 9. Best Practices for Modal GPU Training

When using Modal for GPU-accelerated model training, consider the following best practices to optimize performance and cost:

### 1. Data Management

- **Use data caching**: For large datasets, use Modal volumes to cache preprocessed data
- **Optimize data loading**: Use proper batch sizes and data loaders
- **Data preprocessing**: Do heavy preprocessing once and save the results

### 2. GPU Efficiency

- **Choose the right GPU**: Select the GPU type based on your workload needs:
  - T4: Good for smaller models, cost-effective
  - A10G: Better for medium-sized models
  - A100: For large models and faster training
  - H100: For the most demanding workloads

- **Optimize batch size**: Find the largest batch size that fits in GPU memory

- **Use mixed precision training**: FP16 can significantly speed up training

### 3. Cost Optimization

- **Monitor usage**: Keep track of your GPU usage
- **Checkpoint models**: Save checkpoints to resume training
- **Use spot instances**: For non-critical workloads, use spot instances for cost savings

### 4. Development Workflow

- **Test locally first**: Develop and test with small datasets locally
- **Use smaller models**: During development, use smaller model variants
- **Debug efficiently**: Use Modal's logs and metrics for debugging

## Conclusion

Modal provides a powerful and flexible platform for GPU-accelerated AI model training and deployment. Its key advantages include:

- Easy access to powerful GPUs without infrastructure management
- Simple scaling from local development to production
- Cost-effective pay-per-use pricing model
- Streamlined deployment of trained models

By following this course, you've learned how to:
- Set up Modal for machine learning workflows
- Use GPUs for accelerated training
- Train deep learning models on Modal
- Implement distributed training across multiple GPUs
- Deploy trained models as API endpoints

To learn more, check out the [Modal documentation](https://modal.com/docs) and explore the [examples repository](https://github.com/modal-labs/modal-examples).